ABSTRACT
 Title of dissertation: DETERMINANTS OF COLLEGE
 GRADE POINT AVERAGES
 Paul Dean Bailey, Doctor of Philosophy, 2012
 Dissertation directed by: Professor Judith Hellerstein
 Department of Economics
 Professor John Wallis
 Department of Economics
 Chapter 2: The Role of Class Di culty in College Grade Point Averages.
 Grade Point Averages (GPAs) are widely used as a measure of college students?
 ability. Low GPAs can remove a students from eligibility for scholarships, and even
 continued enrollment at a university. However, GPAs are determined not only by
 student ability but also by the di culty of the classes the students take. When class
 di culty is correlated with student ability, GPAs are biased estimates of students?
 abilities. Using a  xed e ects model on eight years of transcript data from one
 university with one  xed e ect for student ability and another for class di culty, I
 decompose grades at the individual student-class level to  nd that GPAs are largely
 not biased. Eighty percent of the variation in GPAs is explained by student ability,
 while only three percent of the variation in GPAs is explained by class di culty.
 This estimation is carried out using an ordered logit estimator to account for the
 ordered but non-cardinal nature of grades.
 Chapter 3: Are Low Income Students Diamonds in the Rough?
 Consider two students who earn the same SAT score, one from a lower-income
 household and the other from a higher-income household. Since educational ex-
 pense is a normal good, the lower income student will, on average, have had a
 less well-resourced primary and secondary education. The lower income student
 may therefore be stronger than their higher income counterpart because they have
 earned an equally high SAT score despite a lower quality pre-collegiate environment.
 If this is the case, once the two students start attending the same college|and school
 spending becomes more similar|the lower income student?s in-college performance
 should be relatively higher. I test this theory by using eight years of data from
 one university to compare the grade point averages of students from various family
 income levels. Results show that lower income students appear to be diamonds in
the rough: lower income students have surprisingly high outcomes, conditional on
 their SAT scores. However, unconditional on SAT score, the lower income students
 also outperform their higher income counterparts. This suggests that a single uni-
 versity?s data is inappropriate for answering this question. I also develop how this
 type of regression might give insight into the production function of human capital.
 Speci cally, a common assumption made in the economics of education literature is
 that  rst di erenced human capital accumulation rates are independent of ability
 because ability is already represented in the test used as a base period. A \diamonds
 in the rough" result would contradict that assumption, and show that SAT is not a
 perfect measure of underlying ability.
 Chapter 4: Estimation of Large Ordered Multinomial Models.
 Decomposing grades data into class  xed e ects and student  xed e ects is
 di cult and the estimator?s accuracy is unknown. I describe the successful applica-
 tion of the L-BFGS algorithm for  tting these data and propose a new convergence
 criterion. I also show that when the number of classes is about 32 (slightly fewer
 than is typical at the University of Maryland), the estimator performs well at es-
 timating correlations and the non-parametric statistics used in Chapter 2 of this
 dissertation. Some issues with signi cance testing the sets of  xed e ects are also
 considered and I show that when the number of classes is 32, the signi cance tests
 are not su ciently protective against false rejection of the null hypothesis. The
 jackknifed likelihood ratio test is shown to be only modestly biased towards false
 rejection regardless of the number of classes per student.
DETERMINANTS OF COLLEGE
 GRADE POINT AVERAGES
 by
 Paul Dean Bailey
 Dissertation submitted to the Faculty of the Graduate School of the
 University of Maryland, College Park in partial ful llment
 of the requirements for the degree of
 Doctor of Philosophy
 2012
 Advisory Committee:
 Professor Judith Hellerstein, Co-Chair/Advisor
 Professor John Wallis, Co-Chair/Advisor
 Professor John Haltiwanger
 Professor John Chao
 Professor Paul Hanges
c Copyright by
 Paul Dean Bailey
 2012
Dedication
 This thesis is dedicated to my father, Dr. W. David Bailey.
 ii
Acknowledgments
 I treasure the substantial support from the University of Maryland community
 and within my family that made this thesis possible.
 I am forever indebted to John Wallis, who taught me how to identify and write
 an economics paper. In this formidable task, he was extremely generous with his
 time and patience.
 I appreciate the assistance of Judith Hellerstein for showing me how a labor
 economist thinks about a paper. This is a skill for which I will always be grateful
 and hope to put to good use throughout my career.
 I am also grateful for several important conversations with John Haltiwanger
 at key moments that provided useful guidance about which aspects of my research
 were actually interesting to an outside reader.
 I also appreciate the time John Chao took with me in identifying state-of-
 the-art econometric techniques and providing a sounding board for my direction of
 inquiry for the fourth chapter.
 Thanks are also due to Paul Hanges for serving on my thesis committee, taking
 the time to give my thesis a careful reading, and providing thoughtful responses.
 I also appreciate the assistance of Kyland Howard and the O ce of Insti-
 tutional Research, Planning and Assessment at the University of Maryland who
 generously gave substantial time and energy when providing me with the data used
 in this thesis.
 A special thanks is also due to Vickie Fletcher who was always happy to help
 iii
and watch out for me. She is an example of everything that a graduate studies
 coordinator can possibly be and so much more.
 My colleagues and friends also provided help advice, support, and levity as
 the case required. I am indebted to Abby Alpert, Juan Bonilla-Angel, Teresa Fort,
 Carolina Gonzalez-Velosa, Aaron Szott, and many others.
 A million thanks are due to my wife, Louise, who never failed to give me
 time and space to work on any part of my graduate degree and provided love and
 support without fail at every turn. I also appreciate the understanding of my son,
 Eli, and daughter, Keira, at times when I was busy working instead of enjoying their
 company.
 iv
Table of Contents
 List of Figures vi
 List of Abbreviations vii
 1 Determinants of College GPAs 1
 2 The Role of Class Di culty in College Grade Point Averages 8
 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
 2.2 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
 2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
 2.4.1 Decomposing grades using the ordered multinomial estimator . 20
 2.4.2 The ability expansion path . . . . . . . . . . . . . . . . . . . . 21
 2.4.3 Variation in class di culty . . . . . . . . . . . . . . . . . . . . 22
 2.4.4 Robustness to other regressors . . . . . . . . . . . . . . . . . . 25
 2.4.5 Robustness to estimator . . . . . . . . . . . . . . . . . . . . . 27
 2.4.6 Robustness to sample selection criteria . . . . . . . . . . . . . 28
 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
 3 Are Low Income Students Diamonds in the Rough? 45
 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
 3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
 3.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 60
 4 Estimation of Large Ordered Multinomial Models 77
 4.1 Simulation of estimators . . . . . . . . . . . . . . . . . . . . . . . . . 80
 4.2 Signi cance testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
 4.2.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
 4.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
 4.3.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
 4.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 97
 A Data 104
 v
List of Figures
 2.1 Ability expansion path . . . . . . . . . . . . . . . . . . . . . . . . . . 40
 2.2 Histogram of average grade awarded by departments . . . . . . . . . 40
 2.3 Histogram of student GPAs . . . . . . . . . . . . . . . . . . . . . . . 41
 2.4 Histogram of average grade awarded by class . . . . . . . . . . . . . 41
 2.5 Observed ability expansion path . . . . . . . . . . . . . . . . . . . . 41
 2.6 Raw GPA versus average class di culty . . . . . . . . . . . . . . . . 42
 2.7 Ability expansion path with various additional regressors . . . . . . . 43
 2.8 Ability expansion path from OME and OLS estimators . . . . . . . . 44
 2.9 Ability expansion path from the three datasets . . . . . . . . . . . . . 44
 4.1 The relationship between ability, observed ability, and grades. . . . . 102
 4.2 Type of variation the estimator allows between classes. . . . . . . . . 102
 4.3 An example of the likelihood of a particular grade given observed
 ability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
 4.4 Simulated performance expansion path for the \uncor" simulation. . . 103
 4.5 Simulated performance expansion path for the \cor" simulation. . . . 103
 4.6 Simulated performance expansion path for the \High GPA" simulation.103
 vi
List of Abbreviations
 BFGS a numerical maximization / minimization algorithm
 GPA Grade Point Average
 IRB Institutional Review Board
 L-BFGS a limited memory variant of BFGS
 OLS Ordinary Least Squares
 OME Ordinal Multinomial Estimator
 STEM Science, Technology, Engineering, and Mathematics
 OIRPA O ce of Institutional Research, Planning and Assessment
 vii
Chapter 1
 Determinants of College GPAs
 An enormous amount of human capital is developed in college with signi cant
 implications for the labor market. One of the most agreed upon empirical results in
 economics is that a year of education raises wages by about 7%, so that four years of
 college cumulatively represent a approximately 30% increase in a person?s earning
 ability. Yet there is surprisingly little known about how to measure human capital
 development in college. The most obvious and typical measurement is the grade
 point average (GPA), an average of grades assigned in each class. Since students
 are enrolled in di erent classes and grading schemes may be subjective, it seems
 prudent to wonder whether GPAs are a good measure of human capital.
 In addition, college is often thought of as a single \treatment" or a shared
 experience across students. If there are substantial di erences in the average class
 di culty across student ability levels, that would imply that college o ers not one
 single treatment, but rather di erent treatments for di erent students. Intuitively,
 higher ability students might be expected to take more di cult classes with more
 rigorous grading schemes. One might also suspect that students in harder classes
 develop more human capital, so that high ability students trade o higher human
 capital accumulation for lower grades compared to what they would have earned in
 easier classes. This  ts into a \pricing" model developed by Freeman (1999) where
 1
departments that imbue low human capital development pay for (attract) their
 students with high grades.1 The marginal value (in terms of GPA) of a higher grade
 earned in an easier class drops for students near the top of the GPA distribution;
 students who typically get an ?C? or a ?B? have the possibility of getting an ?A?,
 while students who typically get an ?A? have no possibility of getting a higher grade.
 The grade payo s of easier classes are therefore of less value to these higher ability
 students.2
 If each department wants students and can select the di culty of its classes,
 and if students like easier classes, the Nash equilibrium is concerning: everyone gives
 out high grades. This type of behavior has even been observed (Eaton & Eswaran,
 2008).
 There are proposals to standardize post-collegiate testing to allow for a shared
 measure of college output. Similar to standardized testing at pre-collegiate levels,
 such testing would be common across schools, and therefore more objective. But
 this ignores that college is a time when students learn di erent things based on
 their coursework. A student who does not demonstrate signi cant improvement in
 a shared skill like critical thinking during college could have learned speci c skills
 orthogonal to this. For example, knowing accounting standards need not increase
 one?s critical thinking ability but may still be valuable human capital to acquire.
 The best test of what is taught in a class is the assessment of the faculty who know
 1This model has been proposed by others as well (Drew, 2011).
 2This model is like one that is often o ered where STEM (science, technology, engineering
 and math) classes are socially desirable but fail to attract a su cient number of students. As an
 example, President Obama has pushed for an increase in the number of STEM classes, presumably
 with the belief that the human capital developed in these classes is more bene cial to society than
 other classes.
 2
the class best, and the only measure we have of this is the grade. Simply put, grades
 are the only quantitative measure of college performance that measure its diversity
 of subject material, and are therefore always an important outcome to include in
 analysis.
 To make the most e ective use of GPAs in research, one must ask: what
 information is carried in a GPA and how can it be used? To answer this question I
 decompose grades into a student  xed e ect and a class  xed e ect. This gives an
 estimate of student ability controlling for the classes the students is enrolled in and
 an estimate of class di culty controlling for the ability of the students who enroll in
 the class.3 Each student then has an estimated ability and it is easy to estimate the
 average of the di culty of the classes the student enrolled in (the students? average
 class di culty).
 As described earlier, one might suspect that higher ability students are taking
 more di cult classes. One might also suspect that there is a grade \sweet spot"
 which could lead to students of all ability levels taking classes that they expect will
 lead to the same GPAs.
 Chapter 2 of this dissertation measures the association between student abil-
 ity and average class di culty using correlations. These results show that student
 ability and average class di culty are not strongly associated. A second set of cor-
 relations of student ability and class di culty with GPAs shows that the correlation
 between average class di culty and GPA are small and the correlations between
 3Student ability might be better thought of as student output and class di culty might be
 better thought of as anything that attenuates the function that maps the output of a student to
 the grade the student receives.
 3
student ability and GPA are high. GPAs are driven by the ability of the student,
 not the di culty of the classes they take.
 The association of these three variables is also measured by pooling students
 with approximately equal ability levels and then graphing the average, within each
 group, of students? GPAs versus their average class di culty. This gives a plot of the
 typical association between average class di culty and GPA for each ability level
 and is allows any form of joint distribution and is more general than correlations,
 which assumes that the joint distribution is bivariate normal. These plots also show
 a weak association between student ability and class di culty with most of the
 positive association between the two variables coming from the top and bottom of
 the ability distribution.
 These results suggest that the basic intuition about grades being biased by
 student ability is wrong; higher ability students do not, in general, take signi cantly
 harder classes.
 Based on the results of chapter 2, it is reasonable to wonder if an analysis
 of student performance should be using GPAs themselves or if more insight can be
 gained by using student  xed e ects which net out average class di culty. Aver-
 age class di culty is not the main driver of student grades, so it may simply be
 unimportant. The di erence between GPAs and student  xed e ects are important;
 GPAs are easier to calculate and are available on more datasets than the transcript
 data required to calculate student  xed e ects.
 Chapter 3 considers whether family income plays a role in students? GPAs,
 conditional on SAT scores. The question this chapter asks is: If two students earn
 4
the same SAT score, does the environment where they achieved that score matter?
 It could be that SAT fully captures the ability of these students to succeed in
 college. An alternative theory is that students from lower income families have been
 disadvantaged, had to work harder to achieve the same SAT, and will therefore
 outperform their higher income counterparts when they arrive at college and the
 playing  eld is leveled.4 Running this regression on the University of Maryland data,
 I show that, conditional on SAT scores, higher income students receive lower grades
 than lower income students at the University of Maryland. However, on the same
 data, it is also true that unconditionally, higher income students receive lower grades
 that lower income students. This suggests that the result is hardly surprising|low
 income students at the University of Maryland are simply diamonds.
 The analysis in chapter 3 is undertaken twice, once using GPAs themselves
 as a measure of student output, and again using the estimated ability level of each
 student from the decomposition of grades. Thus the analysis applies the question
 of the distinction between GPAs and student ability to a real world problem and
 answers the question of whether the student  xed e ects give di erent results for an
 analysis of grades data. The result is that GPA and student ability give the same
 answer.
 However, one speci cation in this chapter informs one of the larger questions
 of this dissertation. A regression that breaks SAT down into math and verbal
 components shows that a one point gain in SAT math is associated with a larger
 4This thought is considered by Mankiw (2011) who suggests essentially the speci cation used
 in chapter 3. He later added a note from Stinebrickner who references his own article as saying
 the exact opposite occurs at a highly subsidized college (Stinebrickner & Stinebrickner, 2003).
 5
increase in student ability than a one point increase in SAT verbal. At the same
 time a one point increase in SAT math is associated with the same increase in GPAs
 as a one point increase in SAT verbal. For example, given a student with a total
 SAT score of 1000, a breakdown of their SAT into math and verbal components
 is irrelevant to a prediction of their GPA, but that breakdown would help predict
 student ability. Another regression explains this{students who get higher SAT math
 scores balance out their higher student ability by taking approximately equally more
 di cult classes.
 Chapter 4 addresses several econometric concerns regarding the decomposition
 of grades into student and class  xed e ects. Two estimators are considered, ordered
 multinomial estimators (OME) and ordinary least squares (OLS). The OME has
 the advantage of describing the data generating process in a satisfying way, but the
 performance of the estimator is not well understood, while OLS does not describe
 the grading process well but should be accurate conditional on its assumptions. A
 simulation is used to show how these two estimators might perform when estimating
 the type of statistic used in chapter 2. The simulation shows that both estimators
 perform reasonably well.
 Another issue is how to  t these  xed e ects in a the non-linear model (the
 OME). The problem is that the typical methods of solving these models involve
 forming a huge matrix that must then be inverted. I apply the method of (Liu &
 Nocedal, 1989). This method has the advantage that it scales linearly in the number
 of regressors, both in terms of storage and computational intensity of an individual
 step.
 6
Finally, chapter 4 considers the role of bias in estimating signi cance tests
 on the  xed e ects. Both the jackknife estimator for signi cance tests and the
 traditional likelihood ratio test are shown to be reasonably accurate at performing
 the signi cance tests when the number of observations per student is large. When
 the number of observations per student is small, the jackknife remains reasonably
 accurate.
 Many previous papers have suggested that ideally one would estimate the
 e ect of class and student while controlling for the other (Grove & Wasserman,
 2003; Eaton & Eswaran, 2008). This type of analysis has been undertaken before
 on a smaller dataset using a linear model for grades (Arcidiacono et al., 2011), but
 the relationships of student ability and class di culty were not the focus. This
 dissertation describes the results of such an analysis, shows that the analysis is
 accurately estimated, describes the relationship between the class and student  xed
 e ects, and uses the student ability and class di culty to investigate a question.
 The results show that class di culty matters less than one might think, but the
 intuition that better students take harder classes was not entirely false. Students
 with relatively higher math scores do take harder classes, though the same does not
 hold for relatively higher verbal scores.
 7
Chapter 2
 The Role of Class Di culty in College Grade Point Averages
 2.1 Introduction
 Grade point averages (GPAs) are often used as a measure of college students?
 ability, by the university awarding them and by others judging a student?s perfor-
 mance while at the university. GPAs, however, are determined not only by student
 ability but also by the di culty of the classes the student takes, calling into ques-
 tion its status as a measure of ability. While there are many ways that student
 ability and class di culty could be related, one that makes intuitive sense is that
 high ability students take harder classes than low ability students, attenuating their
 GPA gains. In principle, higher-ability students could even take such di cult classes
 that their GPAs would be lower than those of lower-ability students. Under those
 circumstances, GPA would be an extremely misleading measure of ability. Deter-
 mining the role of class di culty in college GPAs is therefore crucial to assessing
 the use of GPAs as a measure of ability.
 To  nd the answer to this question, I estimate a model with one set of  xed
 e ects for students and another set for classes using a large dataset of eight years
 of transcript data from the University of Maryland.1 Using correlation coe cients
 1I apply the term \ability" to student  xed e ects and \di culty" to class  xed e ects. Inves-
 tigating the exact nature of these variables is not a topic of this paper, so they are used despite
 the semantic ambiguity. What I am calling \ability" might be a product of some innate capacity
 to learn as well as student e ort. Similarly, class di culty could also be a measure of the quality
 8
on the  xed e ects, I come to the perhaps surprising result that GPAs are a good
 measure of student ability. In fact, class di culty and student ability are only
 slightly correlated. Moreover, student ability explains 80% of the variation in GPA
 while the average di culty of classes in which a student is enrolled explains only
 3% of the variation in GPAs.
 This result is unexpected because previous research showed that average grade
 varies by department at other universities and colleges (Sabot & Wakeman-Linn,
 1991; Freeman, 1999; Eaton & Eswaran, 2008), a  nding I reproduce for the Uni-
 versity of Maryland datat that I have (Figure 2.2). At the same time, Salbot and
 Wakeman-Linn show that average SAT score does not vary by department. These
 two facts suggest that there is both variation in each classes? class di culty as well
 as variation in each student?s average class di culty (the di culty of classes a stu-
 dent is enrolled in averaged over a student?s transcript ). I am able to quantify the
 extend of this and  nd that variation in individual grades is approximately equally
 predicted by the di culty of the class as the ability of the student. However, for
 GPA to be a biased estimate of student ability, it must also be the case that stu-
 dents? average class di culty is correlated with the students? ability|for example,
 if higher-ability students are systematically enrolled in more di cult classes. I  nd
 that, surprisingly, this third condition is not present at the University of Maryland
 and, for this reason, GPAs are not very biased by student ability.
 I account for grades being a limited dependent variable{an ?A? is better than
 a ?B?, but how much better is not obvious{by using an ordered logit estimator. The
 of pedagogy in each class{more \di cult" classes would simply have poorer instruction.
 9
consistency/bias properties of the ordinal logit estimators are not well known for a
  xed e ects model, so chapter 4 of this dissertation explores how these estimators
 perform with grades data using simulations. The simulations show that the ordinal
 multinomial estimators do a good job of estimating the statistics used in the results
 of this paper and perform better than the OLS estimator.
 Fitting such a  xed e ects model requires large amounts of data because a
 greater number of observations per student and class improves the estimator?s per-
 formance. Because of this I chose a lengthy time frame of administrative data from
 a large state university. In the  tted regression, an observation is a grade that a
 student receives in a particular class. There are  ve hundred thousand observations
 over eight years and nine thousand students with, on average, forty classes on their
 transcript.
 Because the correlations used to arrive at the main result assume a joint distri-
 bution that is bivariate normal between the two variables being compared, I develop
 the concept of an ability expansion path to describe the relationship between class
 di culty and student ability more thoroughly. Speci cally, I group students by esti-
 mated ability level and then plot their average GPA versus their estimated average
 class di culty (Figure 2.1 provides an example).
 The ability expansion path graphically depicts any systematic di erences be-
 tween higher and lower ability students with regard to class di culty. For example,
 if higher and lower ability students take classes of the same di culty level, the ex-
 pansion path is vertical and GPAs are not biased (expansion path \a" in Figure
 2.1). In contrast, if higher ability students systemically enroll in harder classes, dif-
 10
ferences in GPAs underrepresent the di erences in ability and the expansion path is
 positively sloped (expansion path \b"). When high ability students enroll in classes
 that are even harder still, their increase in ability is entirely masked in GPAs and
 the expansion path is horizontal (expansion path \c").
 I observe an estimated expansion path that is largely vertical, with some devi-
 ations from this at the top and bottom of the ability distribution. This is consistent
 with my general  nding that GPAs are not very biased. At the top of the distribu-
 tion, GPAs are positively correlated with average class di culty. Speci cally, the
 top students (those with GPA> 3:8) are systematically enrolled in harder classes.
 At the bottom of the distribution (students with GPAs< 2:5), GPAs are again posi-
 tively correlated with average class di culty when looking at students who are on a
 path to graduate, but are uncorrelated when including students who are dropouts.
 Thus, for graduates, the positive correlation at the bottom of the ability distribution
 is induced by the university?s GPA minimum, which eliminates those lower-ability
 students who were enrolled in more di cult classes at entrance.
 Knowing the role of class di culty in grades is important because low GPAs
 can remove a student from eligibility for scholarships and even continued enrollment
 at a university; GPAs are also used for graduate admissions and potentially in uence
 labor market outcomes (Loury & Garman, 1995; Jones & Jackson, 1990). While one
 might initially suspect that the importance of GPAs represent a misplaced faith in
 their value as a measure of ability, there is an irreducible reason to use this less
 than transparent measure of ability{there is no substitute quantitative measure of
 ability in college. Because of this, and despite the real possibility that grades may
 11
be biased, they are frequently used as outcomes (Angrist et al., 2009; Klopfenstein
 & Thomas, 2009; DeSimone, 2008; Betts & Morell, 1998) and as regressors (Loury
 & Garman, 1995; Jones & Jackson, 1990).2 In primary and secondary schools, stan-
 dardized tests are speci cally designed to provide a comparison external to grades.
 However, the diversity of students? educational experiences at college prevents this
 type of comparison{what standardized test could capture the material learned by
 a student majoring in computer science and simultaneously capture the material
 learned by a student majoring in music? Therefore, despite their  aws, grades re-
 main the best available summary measure of student ability. Understanding the
 factors that in uence GPAs is critical to their e ective use both in research and
 practical applications.
 The following section of this paper describes challenges with using correlation
 coe cients for discrete data, such as student grades. Sections three and four describe
 the data and the results. A  nal section concludes.
 2.2 Correlations
 Grade performance is described by a student ability contribution (a) and a
 class di culty contribution (d). Typically, when reporting regression results, the
 main results of interest involve a small number of estimated regression coe cients.
 In the case of a  xed e ect model there are thousands of regression coe cients
 and interpretation of all coe cients is neither possible nor desirable. The impor-
 2In some of these papers, dropout rate is another important measure of college performance,
 but it is only useful when looking at students close to the margin of dropout.
 12
tance of the regression coe cients in determining grades is measured instead by the
 correlation between the  xed e ects estimates and the grades.
 The Pearson correlation coe cient is not an appropriate measure of the cor-
 relation between student performance (a), class grading di culty (d), and grades
 (G), The Goodman-Kruskal gamma is appropriate for reasons that are described in
 this section.
 The Pearson correlation coe cient for two variables (A and B) has three
 salient properties that give it currency as a statistic: it is invariant to scale so
 that Cor(A; cB) = Cor(A;B) for any scalar c; second it is invariant to location
 so that Cor(A + c; B + d) = Cor(A;B); third, it takes on values in [ 1; 1] with
 zero indicating no covariance between A and B and larger absolute value numbers
 indicating stronger covariance. In the extreme, a value of 1 or  1 indicates a linear
 relationship of the form
 A = c0 + c1B: (2.1)
 For discrete data, which can be represented on a two-way table, the Pearson type
 properties cannot be preserved. Other properties are more appropriate for discrete
 data. For grades, operations like multiplying or adding an arbitrary constant to
 the data do not make sense,3 so the  rst two properties of the Pearson correlation
 coe cient are not relevant. However, the concept that a correlation should be able to
 take on values on all of the values in [ 1; 1], does remain sensible and is maintained
 3Adding one to a ?B? might appear to make sense, but adding  to ?B? is more di cult to
 interpret.
 13
by some rank correlation coe cients. Additionally, a correlation coe cient of 1
 is associated not with a linear relationship but instead with a weakly-monotonic
 relationship (Ghent, 1984).
 Rank type correlations can be represented as operating on a two-way table
 so that there are cells potentially above, below, to the right, and left as well as
 diagonals from any given cell (X), i.e.
 variable A
 variable B
 . . . . . . . . .
 . . . X . . .
 . . . . . . . . .
 For data tabled this way, the Goodman-Kruskal statistic uses all possible pairs of
 data and de nes concordance between a reference cell value and another value when,
 relative to the reference cell, it is above and to the right or down and to the left.
 These relationships suggest A and B are associated because larger values of A are
 associated with larger values of B. Discordance is the name given to cells above
 and to the left or below and to the right. These relationships suggest A and B
 covary negatively. The table below shows these two types of cells labeled C and D
 for concordance and discordance with cell X, respectively for grade data and SAT
 scores.
 14
Grade
 F B A
 SAT score
 high D . . . C
 mid . . . X . . .
 low C . . . D
 The remaining relationship is those cells that are in the same row or column as X
 and are not included in the calculation of the Goodman-Kruskal statistic because
 they are uninformative. They are ties where the data provide no information.
 Using these de nitions, the Goodman-Kruskal correlation coe cient is
   G =
 C  D
 C +D
 : (2.2)
 In the case of two variables A and B being bivariate normally distributed,
 when the Pearson correlation coe cient is c, then A can be said to explain c2 of the
 variation in B. This results from a maximum sum of squares for Pearson correlations
 of a multivariate normal distribution being one. This means that if Cor(A;B) = c
 and Cor(B;C) = 0 then the largest possible value of Cor(A;C) is
 p
 1 c2. However,
 for rank correlation coe cients, there is no such hard and fast rule and thus the
 correlation coe cients cannot be interpreted with the same \percent explained"
 interpretation.
 15
2.3 Data
 The primary data used in this paper is transcript data from the University of
 Maryland from the years 2003 to 2010. Each individual observation consists of a
 class identi er, an anonymized student identi er, and the grade the student received
 in that class. In addition, students? application information is linked and used for
 baseline characteristics. These variables are summarized Table 2.1.
 The data were provided by the the University?s Institutional Research Planning
 and Assessment group as two  les, one with transcript entries and another with
 application information.4 Because the data were used to produce transcripts and for
 applications, it arrived relatively \clean" with few missing values. The transcript  le
 itself would be su cient for constructing most of the variables necessary to run the
 regressions in this paper. However, some cleaning was necessary. More information
 on the data is given in the Data Appendix.
 Some of the variables included in the results for an individual student are
 derived from more than one transcript entry. Information was aggregated to the
 following levels:
 Individual class level this is the transcript (one student in a single class who re-
 ceives a grade) and represents no aggregation; regressions were run at this
 level;
 Semester level the total number of courses completed to date, and number of credits
 4In compliance with our IRB application, student identi cation numbers were scrubbed from
 the  le, but a new version that was not related to the actual student identi cation number was
 placed on the new  le so that a transcript could be built. For the purpose of this paper, these new
 identi cation numbers serve all of the purposes of a student identi cation number.
 16
enrolled was calculated by aggregating data to the semester level;
 Student level the GPA and demographic information from the application were cal-
 culated by aggregating each student?s entire transcript.
 Students who receive GPAs of exactly 4.0 or very low GPAs are excluded from
 analysis. Students receiving 4.0s represent about 100 total observations in the sam-
 ple, and low GPAs are those where the student never earned more than a ?C?. This
 is done because the simulation results (section 4.1) show that the estimated statis-
 tics are unchanged or improve when boundary cases are removed. A few percent of
 the students meet these criteria, almost all of them because they never passed any
 classes.
 In addition, internships and classes in which fewer than  ve students enrolled
 over the seven years studied are not included{these classes are not shared experiences
 in the same way that most other classes are. This restriction represents a small
 number of total credits, but a substantial number of classes.5
 The remaining data used is  ltered to include only those students who are
 between ages 18 and 25 when they entered college, started in 2003 to 2005, were
 matriculated at some point,6 and stayed enrolled for at least eight semesters or
 completed 120 credits (the minimum for graduation) by 2010. I call this the \degree"
 sample (n = 9; 410) because these are the students who completed or are likely on
 the road to completing a degree at the University. The entry and exit pro le for
 5When these classes are dropped, GPAs calculated based on the remaining classes need not meet
 the minimum for graduation, even for graduating students. Some of the  gures include students
 who apparently received lower than 2.0 GPAs, but this is just for the non-excluded classes.
 6Matriculated is the status of a student who was enrolled as a degree-seeking student.
 17
these students is shown in Table 2.2.
 For robustness checks, two expanded samples are considered. One is all stu-
 dents who entered between 2003 and 2005, as long as they completed at least  ve
 classes since enrolling and were matriculated at some point. This sample is called
 the \enter" sample (n = 18; 400) because this is the entering cohort of the \degree"
 sample. The robustness of the results to the inclusion of students that eventually
 left the program is tested using this sample. The entry and exit pro le for students
 in the \enter" sample is shown in Table 2.3.
 The  nal expanded sample includes all students who have at least  ve tran-
 script entries, regardless of when (within 2003-2010) they entered the university
 or whether they matriculated. This is called the \full" sample because it contains
 almost all of the students at the university during this time range.
 The \degree" and \enter" samples are similar in their covariates. Of those
 who took the SAT, the \degree" sample?s average was higher by only seven points
 on the verbal and seven points on the math test. This represents an approximately
 two percentile di erence in test scores. Relative to the \degree" sample, the \enter"
 sample had a lower average GPA by 0.14,  ve percentage points fewer took the SAT
 test, twelve percentage points more transferred. By de nition, every person in both
 the \degree" and \enter" samples was a matriculated student.
 The \full" sample includes students who did not have su cient time to com-
 plete a degree because they enrolled in the last 5 years. The only quali cation to be
 in the sample is to have entered the University of Maryland, but the time range is
 longer and the potential tenure shorter. Because of that, they more closely resemble
 18
the \enter" sample but have completed fewer classes and credits.
 Histograms show the distribution of average grade points for departments,
 students, and classes. For each of these
  GP =
 1
 ni
 X
 GPi
 whereGPi is the grade points for grade i associated with the unit (department/student/class),
 and are taken from the typical GPA calculation (?A? = 4.0, ?B? = 3.0, etc.), and ni
 is the total number of observations associated with the unit.
 Previous studies have found substantial variability in average grade by depart-
 ment (Sabot & Wakeman-Linn, 1991; Freeman, 1999; Eaton & Eswaran, 2008); this
 is also the case with this sample (Figure 2.2). In addition, student and class average
 grades show substantial and similar variation (Figures 2.3 and 2.4, respectively).
 Note that peak is nearly  at over an entire grade point.
 2.4 Results
 The decomposition of grades into a set of  xed e ects for student and another
 set for classes is  rst estimated with the OME, with no additional regressors, on the
 \degree" sample using the equation
 y ij = ai  dj +  ij : (2.3)
 19
Where i indexes students, j indexes classes, and each class in which a student
 receives a grade is an observation. Robustness of the results is then tested to the
 addition of time dependent student speci c regressors, the estimator used (ordered
 logit, probit, and OLS), and the selection criteria for the sample.
 2.4.1 Decomposing grades using the ordered multinomial estimator
 Using the  tted values from the main regression equation (eq. 2.3), correlations
 are then calculated between the  tted ability (ai), GPA, and a student?s average class
 di culty (  di), where a student?s average class di culty is simply the average of the
 class di culties of the classes in which they are enrolled. This is given only the
 third by
  di =
 1
 ni
 X
 j2fi0s classesg
 dj ; (2.4)
 where the set fi?s classesg is the classes in which student i is enrolled, and ni is the
 number of classes in which the ith student is enrolled.
 GPAs are only slightly biased. The correlation between student ability and
 class di culty is only 0.16. This is a small correlation, student ability and class
 di culty are only slightly related. GPA and class di culty are also not strongly
 related with a correlation coe cient  0:19; squaring the correlation coe cient gives
 0.036, or 3.6% of the variation in GPA is explained by class di culty. Higher ability
 students are taking only slightly harder classes. The main driver of GPAs is ability,
 with a correlation coe cient of 0.90, meaning that about 80% of the variation in
 20
GPA is explained by student ability alone (Table 2.4).
 The ordered multinomial estimator generates \intercepts" for each grade bound-
 ary. These are useful for interpreting the magnitude of the e ects shown above,
 which are on a logistic scale. These values show that the \width" of a grade in-
 creases with the grade, so that ?B? is 2.47 units wide (Table 2.5, far right column)
 and encompasses a larger range of observed ability than ?C? (1.65 units wide) or
 ?D? (0.59 units wide). These can be used to judge the magnitude of values in these
 units. For example, a ?B? grade is about 2.47 wide in these units, so a student
 whose ability places them at the lower edge of ?B?s (receiving about 1/2 grades ?B?
 or higher and 1/2 grades ?C? and lower) has 2.47 lower ability ceteris paribus than
 a student whose ability places them at the upper edge of ?B?s (receiving about 1/2
 ?A?s and 1/2 grades ?B? and lower). Because grades have a maximum and minimum
 value, the GPA of these students would be expected to be less than 1.0 apart.
 2.4.2 The ability expansion path
 The correlation coe cients between student ability, GPA, and average class
 di culty are parametric tests and assume that the two variables being correlated are
 bivariate normally distributed. The ability expansion path makes no such assump-
 tion and shows the relationship between all three of these variables most clearly
 (Figure 2.5). This plot shows that for the vast majority of students with GPAs
 between about 2.5 and 3.8, the ability expansion path is essentially vertical and
 there is no GPA bias. However, for the highest ability students (those with GPA
 21
> 3:9), the average class di culty is substantially harder, and for students with
 GPAs below about 2.5, GPA and average class di culty are positively associated.
 This suggests that what little bias there is in GPAs is focused in these two groups.
 To graphically demonstrate the tradeo between grades and di culty level,
 each student?s individual estimated average di culty and observed GPA are plotted
 in 20 color bands where each band represents students with approximately similar
  tted values of ai (Figure 2.6). This shows that students with a common level
 of student ability (sharing a single color) are, in fact, trading o between di cult
 classes and higher grades.
 Having described the properties of the  tted ai and dj regressors, it is impor-
 tant to know that the estimated coe cients are statistically signi cant. The LR
 test for the  xed e ects in eq. 2.3 shows that addition of class and students  xed
 e ect in any possible order is highly signi cant (Table 2.6).
 2.4.3 Variation in class di culty
 Earlier in this chapter I mentioned that there are three conditions for grades to
 be biased: (1) there must be variation in class di culty, (2) there must be variation
 in average class di culty, and (3) the variation in average class di culty must be
 correlated with student ability. The results so far raise the question of which of
 the three conditions for bias in grades actually hold, since grades are only very
 slightly biased. A skeptic might wonder if the graph of average grade points by
 department was misleading and was actually the result of random variation and
 22
not fundamental dispersion in grades. But this is not the case. The graph gives an
 accurate impression, because only the third condition for bias is not present; student
 ability and average class di culty are not correlated. This occurs even though there
 is substantial variation in class di culty and average class di culty.
 The  rst condition holds. There is substantial variation in di culty across
 classes. The standard deviation of class  xed e ects (dj) is 1.79 (Table 2.7, col-
 umn A), or about 3/4 of the width of the ?B? range. The standard deviation of
 student  xed e ects is approximately equal at 1.76. Both of these ways of inter-
 preting the variation in class di culty show that the variation in class di culty is
 large.
 Another way to look at the variation in class di culty is to calculate raw
 correlations between class di culty, student ability, and grades awarded in each
 class. The Goodman-Kruskal correlation coe cient between grades and student
  xed e ect ( -Gg;a) is 0.43 while the correlation coe cient between grades and class
  xed e ect ( -Gg;d) is 0.34 (Table 2.8).7 These two coe cients are of approximately
 the same magnitude, meaning that the ability of the student taking a class and the
 di culty of the class are approximately equally important in predicting the awarded
 grade.
 The substantial variation in class  xed e ect and the relative size of the
 Goodman-Kruskal correlation coe cient between classes and grades shows that
 classes vary in di culty about as much or slightly less than students vary in ability.
 7The remaining correlation coe cient  -Ga;d =  0:06 simply mirrors the low Pearson corre-
 lation coe cient between average class di culty and grade|ability and class di culty are not
 strongly related.
 23
Results from these ways of viewing the di culty show that the  rst condition is
 readily met; students have a very wide selection of class di culty levels to chose
 from, and do so.
 The second condition also holds. There is substantial variation in students?
 average class di culty (  di). The observed standard deviation in average class dif-
  culty (  di) is 0.66. Because it is an average, the standard deviation of
  di has to
 be smaller than the standard deviation of dj, but how much smaller is an empirical
 question.8 One way to put this value into context is to compare it to what would
 happen if the values of dj were random draws from the available classes. If this
 were the case, then the central limit theorem says that, given that each student in
 the sample takes about ni  40 (Table 2.1) classes, then the standard deviation
 of average class di culty (  di) should be 1:79=
 p
 40 = 0:28. Thus 0.28 can be used
 as a yardstick for measuring the observed variation in  di. Values higher or lower
 than 0.28 indicate that something is raising or lowering the variation in average class
 di culty relative to random assignment to classes. The observed standard devia-
 tion of 0.66 is over twice the random assignment value of 0.28{there is substantial
 variation in average class di culty.
 Putting these two together, only the  nal condition for bias in grades does not
 hold{ability and average class di culty are not strongly correlated with each other.
 In other words, essentially all students are enrolled in an equivalent mix of hard and
 easy classes.
 8This result follows from the mild assumption of  nite support for dj .
 24
2.4.4 Robustness to other regressors
 These results so far are based on the decomposition of grades into a student
 and a class  xed e ect (eq. 2.3). But other things also change within students?
 careers at a university. I conduct a robustness check of these results to additional
 regressors of the form
 y ijt = ai  dj + xit +  ijt (2.5)
 where xit are student and semester speci c regressors and  are the associated
 regression coe cients.
 Previous studies have observed grade in ation, so year  xed e ects are added
 (Sabot & Wakeman-Linn, 1991). This model also includes semester of the year
 (Spring, Summer I, Summer II, Fall, and Winter)  xed e ects to account for possi-
 ble time of year variation in grading. These are statistically signi cant (Table 2.7,
 column B) and trend in the right direction (Table 2.9, column B). Fall and Spring
 are 15 week semesters and the other three terms are shorter terms where students
 often focus on a single class. Grades are substantially higher in the shorter terms;
 this could be because of changes in faculty posture towards grades or students? con-
 centration or interest level when taking one class at a time. Despite the improvement
 in the  t when these regressors were added, the main results do not change when
 adding these regressors (Table 2.7).
 Another possible issue is that student  xed e ects (ai) might not be so  xed
 25
and might change through time.9 Adding  xed e ects for class year is statistically
 signi cant (Table 2.7, column C) and does not change the correlation coe cients
 appreciably. However, despite the theoretical reasons to believe that there should
 be a positive trend in class year, there is a clear negative trend (Table 2.9).
 One might suspect that students taking more or fewer classes could do better or
 worse, so registered (graded) credits and total (graded + ungraded) credits regressors
 are added (Table 2.7, column D). The addition is statistically signi cant. From
 this table it is apparent that the correlations of interest are una ected by this
 addition. The coe cients are very small; adding a typical graded class with three
 credits (registered and total) decreases the estimated value of the (latent) prediction
 by 0.0097, about a third of a percent of the distance between the cuto between
 a ?C+? and ?B-? and the cuto between ?B+? and ?A-?. This could be because
 students are at the optimum number of classes for their allocation of time to studying
 college material, or because additional classes do not tend to have negative over ows,
 perhaps because time spent studying in college is very low so that students? e ort
 is not constrained by time (Babcock & Marks, 2010).
 The ability expansion path is very robust to inclusion of these controls with
 almost no change as they are added (Figure 2.7).
 Finally, the intercepts for the grade boundaries are very stable across speci -
 cation (Table 2.5).
 9For example, students might be accumulating human capital.
 26
2.4.5 Robustness to estimator
 The OME describes many possible estimators. Ordered logit is the name for
 the OME when the error term is drawn from the extreme value distribution, while
 ordered probit is the name for the OME when the error term is normally distributed.
 Which one to use is sometimes resolved a priori based on the data generating process.
 Another way to decide which estimator to use is to treat it as an empirical question
 by adding a parameter that adjusts between zero and one when the error term
 is extreme value distributed or normally distributed, respectively (McCullagh &
 Nelder, 1989). A LR test on this parameter tests whether logit is better than
 probit. For this sample, the logit is a substantially better  t than probit, with a
 chi-squared of 240 on 1 degree of freedom. The associated p-value is essentially zero.
 Because of this, I use the ordered logit exclusively when estimating the OME.10
 An alternative to the OME estimator is the OLS estimator, which assumes
 that the residual in the  xed e ects estimate are normally distributed. When using
 the OLS estimator, the main speci cation changes from (eq. 2.3) to,
 GPij = ai  dj +  i;j;
 where GPij is the grade points awarded to student i in class j.
 The stylized facts of the results do not change with the OLS estimator. In
 the OLS  t, the standard deviation of the class  xed e ects is 0.48 and the stan-
 10Other alternatives to the ordered probit and ordered logit that were tested include the asym-
 metric error terms from log-log and complementary log-log link functions described in McCullagh
 & Nelder (1989). Both were extremely poor  ts for these data.
 27
dard deviation of student  xed e ects is 0.62. Similar to the OME estimate, these
 two standard deviations are of about the same magnitude. Qualitatively, the re-
 gression coe cients on the models with additional controls are similar (not shown).
 The correlations between student  xed e ect, class  xed e ect, and grade are also
 qualitatively similar (Table 2.10,  rst column).
 The ability expansion path for the OLS results have a di erent interpretation
 than in the OME results, because the x-axis is a latent variable in the OME and it is
 not when using OLS. However, the resulting curves are remarkably similar nonethe-
 less (Figure 2.8). The principal di erence is at the top of the grade distribution
 where the simulation in chapter 4 shows that OLS performs poorly.
 2.4.6 Robustness to sample selection criteria
 Looking across samples, the results are qualitatively similar. The correlations
 coe cients for these three samples are shown in Table 2.10. These results suggest
 that the sample selection criteria does not change the conclusions.
 The ability expansion path for these groups is qualitatively similar at every
 level except at the bottom (Figure 2.9). This suggests that, for students in the
 \Degree" sample, the lower tail?s leftward shift is the result of removal of lower-
 ability students who were enrolled in more di cult classes at entrance.
 28
2.5 Conclusion
 GPAs and student ability are strongly correlated, and the GPA is not strongly
 biased. A GPA is therefore a reasonable proxy for student ability. An interesting
 result of this chapter is that although a student?s GPA strongly re ects their ability,
 the same cannot be said of any particular grade a single student receives in an
 individual class. In fact, student ability and class di culty contribute equally to
 any individual grade. However, because students of varying ability are enrolled in
 very similar mixes of easy and di cult classes, the in uence of class di culty on
 individual grades disappears in the overall GPA.
 A necessary condition for GPAs to be biased is that individual classes must
 vary in di culty. This condition holds{the variation in individual class di culty
 (class  xed e ects) is so large that ability (student  xed e ects) and class di culty
 are approximately equally important in determining any individual grade. Individ-
 ual grades, even conditional on the class, are dominated by noise. Unconditionally,
 individual grades are not comparable; ability is not strongly predictive of the grade.
 Thus there is a strong distinction between the in uence a student has on their overall
 GPA and their grade in each individual class.
 Bias in GPA also requires that within-student-average class di culty (average
 class di culty) varies. This condition also holds{variation in average class di culty
 by student is substantial.
 Finally, bias in GPA requires that the variation in average class di culty is
 associated with student ability. This condition does not hold{average class di culty
 29
and student ability are only slightly correlated. Therefore, GPAs are largely not
 biased at the University of Maryland. Student ability predicts 80% of the variation
 in GPAs and the empirical ability expansion path is nearly vertical.
 The shape of the ability expansion path also informs the mechanism by which
 human capital or ability is increasing human capital production{students with higher
 ability are taking classes of the same di culty level as the low ability students.
 There are two exceptions to the vertical expansion paths{students at the top
 and at the bottom of the ability distribution, where ability and di culty are posi-
 tively associated.
 Students at the top of the ability distribution appear to enroll in harder classes
 and get mostly ?A?s in these classes. Because of this, the di erence between a student
 who earns a 3.9 GPA versus a 3.91 is much larger than the di erence between a
 student who earns a GPA of 3.0 versus 3.01.
 At the bottom of the ability distribution, students who are continuously en-
 rolled take easier classes. This appears to be a mechanical e ect of a minimum GPA
 for graduation.
 When using grades as a measure of ability, especially for students who are in
 the middle of the GPA distribution, GPAs are a good predictor of ability. Average
 class di culty (the variation in class di culty by student) explains little (about
 3%) of the variation in GPAs. There is also 17% residual variation, so one cannot
 rule out that some other confounding factor could a ect grades. When studying
 students that are near the 2.0 GPA cuto , the di culty of the classes that they
 take is important, and an indicator for dropping out could be a better measure of
 30
outcomes than GPA, which is kept near the minimum grade by university policy.
 Because these results are from only one university, they might be observed as
 \local" to one portion of the ability distribution{that observed at a top state uni-
 versity. But two facts suggest that the results are relatively broad: consistent with
 universities studied in other research, there is substantial variation in average grade
 by department, and the range of SAT scores observed is quite wide. Nevertheless,
 reproducing (at other universities) the main results of this paper{that GPAs are a
 good measure of student ability{would be valuable.
 31
Table 2.1: Descriptive statistics of student transcript data
 Rect Enter Full
 GPA 2.94(0.599) 2.80(0.747) 2.82(0.717)
 SAT verbal 599(85.5) 592(88.7) 594(89.2)
 SAT math 632(85.7) 625(87.7) 623(87.4)
 took SAT 0.683 0.635 0.475
 HS GPA 3.80(0.493) 3.75(0.508) 3.75(0.491)
 has HS GPA 0.881 0.790 0.785
 trans. GPA 3.13(0.464) 3.14(0.461) 3.15(0.458)
 has trans. GPA 0.202 0.325 0.300
 female 0.485 0.494 0.483
 white 0.569 0.565 0.575
 age at entry 18.4(1.31) 18.8(1.64) 19.1(1.72)
 n terms 9.94(2.11) 8.25(3.14) 6.23(3.19)
 n classes 41.6(7.91) 34.7(12.7) 26.2(12.9)
 total credits 121(20.1) 101(35.6) 76.4(37.0)
 degree seeking 1.000 1.000 0.992
 n 12,995 19,489 63,875
 Note: Standard deviations are in parentheses for all non-binomial vari-
 ables.
 32
Table 2.2: Entry/exit for \degree" data
 2003 2004 2005
 n 3,909 4,528 4,558
 2004 1 | |
 2005 118 9 |
 2006 476 218 14
 2007 2,738 740 212
 2008 410 2,916 800
 2009 116 497 3,007
 2010 50 148 525
 Note: Each entry shows the number of stu-
 dents in \degree" data (see text) who enter in
 2003-2005 (columns) and who exit in 2004-2010
 (rows). The total number of entrants for each
 year is listed in the  rst row.
 Table 2.3: Entry/exit for \enter" data
 2003 2004 2005
 n 5,709 6,819 6,961
 2003 60 | |
 2004 512 183 |
 2005 741 626 165
 2006 913 1,071 650
 2007 2,881 1,174 1,095
 2008 417 3,080 1,282
 2009 123 520 3,215
 2010 62 165 554
 Note: Each entry shows the number of stu-
 dents in \enter" data (see text) who enter in
 2003-2005 (columns) and who exit in 2004-2010
 (rows). The total number of entrants for each
 year is listed in the  rst row.
 Table 2.4: Main results
 GPAi  di ai
 GPAi |  0:19 0:90
  di  0:19 | 0:16
 ai 0:90 0:16 |
 Note: Pearson correlation coe cients
 ( ) between column and row quantities.
 The quantity  di is the average class dif-
  culty level (dj) aggregated to the stu-
 dent level.
 33
Table 2.5: Grade boundary cut points from regressions
 Model
 A B C D Grade width
 F/D 0:00 0:00 0:00 0:00g0.59 D g0.59 D
 D/C- 0:59 0:59 0:59 0:59g0.36 C-
 C-/C 0:94 0:95 0:95 0:95g0.99 C
 )
 1.65 C
 C/C+ 1:94 1:95 1:95 1:94g0.33 C+
 C+/B- 2:25 2:27 2:27 2:27g0.54 B-
 B-/B 2:79 2:81 2:81 2:81g1.35 B
 )
 2.47 B
 B/B+ 4:13 4:16 4:16 4:16g0.60 B+
 B+/A- 4:73 4:76 4:76 4:76g0.97 A-
 A-/A 5:69 5:73 5:73 5:73
 Note: Grade widths (last column) are based on model D. Because
 the ?F? and ?A? ranges are unbounded, the total width of the bins
 is always in nite.
 34
Table 2.6: Likelihood ratio tests for addition of class and student  xed
 e ects
 Small Large k LR test statistic Jackknife LR
 intercept class 4,809 193,578 342,195
 intercept student 12,994 143,692 262,861
 intercept class + student 17,803 369,029 661,791
 class class + student 12,994 225,336 398,929
 student class + student 4,809 175,450 319,596
 Note: All p-values are indistinguishable from zero. See chapter 4 for a description
 of the jackknife method applied.
 35
Table 2.7: Standard deviations, correlation coe cients, and signi -
 cance tests for various speci cations, estimated with the OME
 Model
 A B C D
 student + class X X X X
 year + semster { X X X
 class year { { X X
 semester credits
 registered and total { { { X
  a 1:60 1:63 1:65 1:66
  d 1:70 1:68 1:61 1:62
   di 0:63 0:63 0:62 0:62
  -Gd;g 0:33 0:33 0:35 0:33
  -Ga;g 0:41 0:44 0:39 0:42
  -Ga;d  0:07  0:06  0:05  0:05
  ai;GPAi 0:902 0:905 0:906 0:907
   di;GPAi 0:161 0:165 0:164 0:164
  ai;  di  0:185  0:176  0:176  0:175
 additional controls  11 3 2
 LR test stat.  5; 461 329 75
  2 p-value  < 10 4 < 10 4 < 10 4
 observations 523; 151 523; 151 523; 151 523; 151
 students 12; 995 12; 995 12; 995 12; 995
 Note: The top block shows the regressors included in the columns. The sec-
 ond block shows correlation coe cients for classes (measured with the Goodman-
 Kruskal correlation coe cient  -G) and transcripts (measured with  ). Subscripts
 indicate ability (a), class di culty (d), average class di culty (  di), and grade g.
 The  nal block shows likelihood ratio tests for inclusion of the additional regressors.
 36
Table 2.8: Goodman-Kruskal
 correlation coe cients
 gij ai dj
 gij | 0:43 0:34
 ai 0:43 |  0:06
 dj 0:34  0:06 |
 Note: Each Goodman-Kruskal
 correlation coe cient ( -G) is be-
 tween column and row quantities,
 grades gij , student  xed e ect (ai);
 and class di culty (dj).
 37
Table 2.9: Ancillary regression coe cients
 Model
 A B C D
 year=2003 {  0:15  0:63  0:60
 2004 {  0:23  0:65  0:62
 2005 {  0:20  0:56  0:54
 2006 {  0:15  0:42  0:41
 2007 {  0:16  0:32  0:31
 2008 {  0:17  0:24  0:24
 2009 {  0:05  0:08  0:08
 2010 { { { {
 term=Spring {  0:04  0:08  0:07
 Summer I { 0:86 0:84 0:84
 Summer II { 0:84 0:84 0:83
 Fall { { { {
 Winter { 1:01 1:05 1:04
 freshman { { { {
 sophomore { {  0:08  0:06
 junior { {  0:20  0:19
 senior { {  0:36  0:34
 registered credits { { { 5:9 10 4
 total credits { { {  38:2 10 4
 Note: A dash indicates a regressor is the omitted level, or that
 the whole set was was not included in the speci cation.
 38
Table 2.10: Robustness of results to sample selection
 \Degree" \Enter" \Full"
 OME
  a 1:76 2:41 2:95
  d 1:80 1:67 1:61
   di 0:66 0:66 0:71
  -Ga;g 0:43 0:45 0:47
  -Gd;g 0:33 0:32 0:30
  -Ga;d  0:05  0:04  0:05
  ai;GPAi 0:86 0:87 0:79
   di;GPAi 0:19 0:23 0:20
  ai;  di  0:10  0:02  0:04
 OLS
  a 1:61 0:81 0:79
  d 0:46 0:46 0:45
   di 0:21 0:22 0:24
  -Ga;g 0:43 0:45 0:45
  -Gd;g 0:31 0:31 0:30
  -Ga;d  0:06  0:06  0:04
  ai;GPAi 0:94 0:96 0:95
   di;GPAi 0:09 0:18 0:16
  ai;  di  0:24  0:08  0:11
 observations 537; 606 677; 634 1; 699; 053
 Note: This table shows the standard deviation of student ability (a),
 class di culty (d), and average class di culty (  di); Goodman-Kruskal
 ( -G) type correlations at the transcript entry level between student
 ability (a), class di culty (d), and grade (g); and Pearson type correla-
 tions between GPA, ability, and the average of all class di culties (  di)
 for OME and OLS estimators.
 39
Figure 2.1: Ability expansion path
 Class difficulty
 G
 PA
 GcG
 b
 Ga
 G
 low ability
 (student FE)
 high ability
 (student FE)
 Note: see text.
 Figure 2.2: Histogram of average grade awarded by departments
 Average grade awarded
 Percentag
 e
 0 
%
 10 
%
 20 
%
 30 
%
 2.0 2.5 3.0 3.5 4.0
 Note: Each department is counted once;
 the histogram is not weighted by enroll-
 ment. Only departments with average
 graded awarded of 2.0 and higher are shown.
 This  gure uses the \Degree" sample.
 40
Figure 2.3: Histogram of student GPAs
 Student GPA
 Percentag
 e
 0 
%
 5 
%
 10 
%
 15 
%
 2.0 2.5 3.0 3.5 4.0
 Note: The data used for this histogram is the
 \degree" sample and only students with GPA
 2.0 and higher are shown.
 Figure 2.4: Histogram of average grade
 awarded by class
 Average grade awarded
 Percentag
 e
 0 
%
 5 
%
 10 
%
 15 
%
 2.0 2.5 3.0 3.5 4.0
 Note: The data used for this histogram is the
 \degree" sample and only classes with average
 grade awarded 2.0 and higher are shown.
 Figure 2.5: Observed ability expansion path
 2.
 0
 2.
 5
 3.
 0
 3.
 5
 Average class difficulty
 G
 PA
 G
 G
 G
 G
 G
 G
 G G
 GG
 G
 G
 G
 G G
 G
 G G
 G
 G
 Note: Each point is a single vigentile (twentieth), showing the average class di culty (as a
 deviation from the grand mean) on the x-axis and the average GPA on the y-axis.
 41
Figure 2.6: Raw GPA versus average class di culty
 0.
 5
 1.
 5
 2.
 5
 3.
 5
 Average class difficulty
 G
 PA
 Note: Individual students? GPA vs average class di culty, with color used to indicate student
 ability (student  xed e ect) by vigentile (twentieth).
 42
Figure 2.7: Ability expansion path with various additional regressors
 2.
 0
 2.
 5
 3.
 0
 3.
 5
 Average class difficulty
 G
 PA
 G
 G
 G
 G
 G
 G
 G G
 GG
 G
 G
 G
 G G
 G
 G G
 G
 G
 G
 G
 G
 G
 G G
 G
 G
 G
 G
 G
 G
 G
 G
 G
 G
 G G
 G
 G
 G
 G
 G
 GG
 G
 G
 G
 G
 G
 A
 B
 C
 D
 Note: Each point is a single vigentile (twentieth), showing the average class di culty (as a
 deviation from the grand mean) on the x-axis and the average GPA on the y-axis. Model A
 is shown in black, model B in red, model C in green, and model D in blue.
 43
Figure 2.8: Ability expansion path from
 OME and OLS estimators
 2.
 0
 2.
 5
 3.
 0
 3.
 5
 Average class difficulty
 G
 PA
 G
 G
 G
 G
 G
 G
 G
 GG
 GG
 G
 G
 GG
 G
 G G
 G
 G
 G
 G
 G
 G
 G
 G
 G
 G
 GG
 G
 GG
 GG
 G G
 G
 G
 G
 G
 logit
 OLS
 Note: Using two estimators: OLS (red) and
 OME (black, also called logit). Each point is
 a single vigentile (twentieth), showing the av-
 erage class di culty (as a deviation from the
 grand mean) on the x-axis and the average
 GPA on the y-axis.
 Figure 2.9: Ability expansion path from
 the three datasets
 1.
 5
 2.
 0
 2.
 5
 3.
 0
 3.
 5
 Average class difficulty
 G
 PA
 G
 G
 G
 G
 G
 G
 G
 GG
 GG
 GG
 GG
 GG
 G
 G
 G
 G
 G
 G
 G
 G
 G
 G
 G G
 G G
 GG
 G
 GG
 G
 G
 G
 G
 G
 G
 G
 G
 G
 GG
 G
 GG
 G G
 G
 G
 G
 G
 G
 degree
 enter
 full
 Note: Using three datasets: \degree" (black),
 \enter" (orange), and \full" (green). Each
 point is a single vigentile (twentieth), showing
 the average class di culty (as a deviation from
 the grand mean) on the x-axis and the average
 GPA on the y-axis.
 44
Chapter 3
 Are Low Income Students Diamonds in the Rough?
 3.1 Introduction
 Consider two students who receive the same SAT score, one from a lower
 income family and another from a higher income family. If investment in schooling
 is a normal good, then the student from the higher income family is the product
 of a higher investment education than the student from the lower income family.1
 Because of this one might suspect the individual from the lower income family has a
 higher \innate ability" while the individual from a higher income family has been the
 benefactor of a higher investment/higher output educational environment.2 When
 these two students enroll at the same university and the di erence in their family
 investment inputs into education are reduced or eliminated, who will perform better?
 This question is important for college admission, where a debate continues3
 as to whether low income students with lower observable ability at the time of
 application should nonetheless be granted entrance, because their past has included
 fewer opportunities and \held them back." In this view, students from low income
 1Note that investment could be pecuniary, or time. The mechanism is not important or inves-
 tigated in this chapter.
 2Here family income is measured using the median income by zip code.
 3See, for example popular press pieces such as Leonhardt (2011) and the response of Mankiw
 (2011). Others have focused on simply measuring collegiate performance as a function of prior
 inputs, with many authors focusing on the predictive power of standardized tests as an ends
 (Betts & Morell, 1998; Cohn et al., 2004; Grove & Wasserman, 2003; Bettinger et al., 2011; Geiser
 & Studley, 2001; Rothstein, 2004).
 45
families are diamonds in the rough.
 Because I am using administrative data from one school (the University of
 Maryland) the students in the sample are not a random sample of college students.
 The college admissions process has multiple steps that make each university?s stu-
 dents a mutually selected group and thus using data from a single school turns out
 to be problematic for this research design. A model that focuses on family income
 and SAT score (as a measure of student ability) at one university is not su cient
 to explore the \diamonds in the rough" hypothesis.
 In addition, at the University of Maryland, high income students perform
 slightly worse than their lower income counter parts unconditionally. When the same
 regression is conditioned on SAT score, low income students continue to outperform
 their higher income counterparts, but by a lower margin.
 According to the traditional model for human capital development, at any
 given time an individual has a particular level of human capital that determines how
 quickly he or she can accumulate new human capital. In the canonical example,
 Ben-Porath describes human capital accumulation as a product of innate ability
 and current human capital (Ben-Porath, 1967), but ability and accumulated human
 capital are essentially identical since they always appear multiplied by one another.
 In a stylized form, production functions based on the Ben-Porath model could be
 written
 dHCt
 dt
 = f(HCt; It); (3.1)
 46
where HCt is accumulated human capital for an individual at time t, It is inputs
 applied to accumulating additional human capital, dHCtdt is the accumulation rate of
 human capital, and f( ;  ) is a function that has positive partial derivatives every-
 where in both arguments. Here the concept of ability manifests as initial levels of
 human capital and is otherwise not present. In this type of model, a measure of
 aptitude/ability is a su cient statistic to explain future performance. Two students
 with similar test scores are expected to be equally productive regardless of whether
 one was initially (at birth) high ability but saw low productivity increases before
 the test and the other was initially low ability but saw large productivity increases
 before the test.
 Education research uses this type of model for value added modeling, with
 human capital accumulation functions of the form (Hanushek, 2006)
 E(dHCt  dHCt 1) = f(It); (3.2)
 where E( ) is the expected value operator, dHCt 1 is a pre-test score, taken before
 the time period of interest, dHCt is a post-test score taken after the time period
 of interest, and It is inputs of interest during the time period of interest.4 The
 underlying assumption is that by subtracting a measure of human capital from the
 previous time period, all inputs to human capital that occurred before the time of
 interest are captured in the measure of human capital measure taken in the pre-
 time-period, t 1.
 4This is a stylized version of the model presented by Hanusheck.
 47
An alternative to the assumptions of these models is that ability is always
 important to production. Regardless of the current level of human capital accumu-
 lated, there is an innate ability to learn that varies among individuals. In this view,
 a test of aptitude will not capture ability to learn and so is inadequate for predicting
 output once at college. Human capital accumulates according to
 dHCt
 dt
 = f(HCt; It;  0); (3.3)
 where  0 is innate ability.5 6 In this model, greater innate ability can compensate
 for lower prior investment. Students with lower pre-collegiate educational invest-
 ments who are able to acquire the same degree of human capital as students with
 higher pre-collegiate educational investments achieved this by virtue of greater in-
 nate ability. Once enrolled at the same university, these lower-investment (i.e., lower
 income) students will outperform their higher-investment (i.e., higher income) peers,
 conditional on having the same level of human capital upon entrance.
 Finally, one might suspect that non-cognitive skills play a role in human capital
 formation in a way that is not completely captured in test scores. It would also make
 sense that higher income families imbue their children not only with higher cognitive
 skills but also higher non-cognitive skills so that family income will be positively
 correlated with college output, even when conditioning on test scores.
 Because the sampled population is college students at the University of Mary-
 5A subscript zero is used on alpha to distinguish it from the later use of a constant in regressions.
 6The Ben-Porath model  ts into this speci cation, but is multiplicatively not identi ed for the
  rst and third arguments so that f(  HCt; It;  0) = f(HCt; It;    0).
 48
land, students must have been admitted to the University and then have chosen the
 University of Maryland for college.7 To understand the role of the selection process,
 it is important to know that the main speci cation is a regression of the form
 [student output] =  +  1  [SAT] +  2  [family income] +  : (3.4)
 I make the assumptions that SAT is an ability measure and that the the SAT score
 is the only measure of pre-collegiate human capital observable to the admissions
 o ce. Making these strong assumptions,  2 is interpretable as an e ect from ability
 itself. The source of these assumptions is a null hypothesis that the model described
 by Hanusheck?s (eq. 3.2) is correct, with the SAT treated as a pre-test (Hanushek,
 2006).
 Surprisingly, running a bivariate regression of GPA on family income yields a
 larger negative coe cient than one gets in equation 3.4 suggesting that low income
 students at the University of Maryland are unconditionally stronger applicants|
 they are simply diamonds. Adding in the SAT, per equation 3.4, mitigates the
 negative coe cient on family income. This is not what the diamonds in the rough
 hypothesis would predict. The negative coe cient on family income, which would
 have been surprising if the unconditional regression coe cient on  2 were positive,
 is not surprising when low income students are unconditionally outperforming their
 high income peers.
 7Dale & Krueger (2002) identify three steps of selection where students select schools to which
 they apply, schools admit students, and then students select schools from those they were admitted.
 Because I am not modeling the selection process, its exact form is less important here.
 49
In the the main speci cation (eq. 3.4) the outcome variable (student output) is
 measured in two di erent ways: GPA and student ability. Student ability should not
 be confused with innate ability ( 0); it is measured as the student  xed e ect in a
 decomposition of grades into student and class e ects and best interpreted as exactly
 that. In chapter 2, I found that GPA and student ability are highly correlated,
 with a correlation coe cient of 0.90, suggesting that both measures should tell a
 similar story in this chapter. The main results of this chapter are approximately the
 same across the two speci cations, consistent with my previous chapter?s  ndings.
 However, I  nd that there is a di erence between GPA and student ability. in one
 speci cation check, I  nd that a one point higher SAT math score is associated with
 higher student ability than a one point higher SAT verbal score. At the same time,
 a one point higher SAT math score is associated with the same GPA as a one point
 higher SAT verbal. Because SAT math and SAT verbal are approximately Z-scores,
 this implies that for students with higher SAT math scores, GPA underestimates
 actual ability.
 The next section describes the data used in this chapter, the sample selection
 criteria applied and its impact on the covariates. The third section presents the
 results and the  nal section concludes.
 3.2 Data
 The data for this chapter, like those described in section 2.3 are drawn from
 transcripts of University of Maryland college students between 2003 and 2010. The
 50
main sample is a subset of the \enter" sample, which focuses on matriculated stu-
 dents who entered between 2003 and 2005 and thus had  ve years in which to com-
 plete their degree{though it is not a requirement that the students in the sample
 did, in fact, complete a degree. Later speci cation checks use the \degree" sam-
 ple, which adds a requirement that the student appears to have graduated, and the
 \full" sample which removes the admission year requirement of 2003 through 2005.
 In contrast with the previous chapter, an observation in this chapter is a student.
 Student performance is measured by GPA and by student ability (ai) as measured
 from the decomposition in the chapter 2 of the form
 Gij = ai  dj +X +  ij (3.5)
 where Gij is the grade student i received in class j, ai is a  xed e ect for the ith
 student and dj is a  xed e ect for the jth class, and there are some other variables
 such as year  xed e ects in X. The exact speci cation used is speci cation D from
 chapter 2 (Table 2.7), which includes controls for semester, class year, and number
 of classes the student is signed up for in the semester.
 Two methods are used to  t this decomposition in chapter 2, ordered multi-
 nomial logit and OLS; in this chapter I use the OLS results because they are readily
 interpreted as changes to GPA while the ordered logit results are in terms of a la-
 tent parameter. For example, when using the OLS results, a student with an ability
 that is 1.0 higher than another student would be expected to get one letter grade
 higher when he or she takes the same class. In contrast, when using the ordered
 51
logit results, a similar statement cannot be constructed for a student with an ability
 1.0 higher than another student.
 Another statistic used to describe student?s college career is the average class
 di culty (  di) de ned as
  di =
 1
 ni
 X
 student i
 dk (3.6)
 where the sum is over the classes student i enrolled in, of which there are ni. Each
 student has his or her own average class di culty, and it is the average of the
 di culty of the classes the student took. In eq. 3.5, the di culty enters with a
 negative in front of it so that more di cult classes are associated with lower grades.
 The median income of the zip code listed as a permanent address is used as a
 proxy for the student?s family income (and this variable is called called family income
 throughout this chapter). This measure can be thought of as typical family income
 in the community the student is from. In addition, while school boundaries and
 zip code boundaries are not identical, they are both based on geographic proximity
 and so membership in a community may not represent identical incomes, but does
 represent access to a similar level of public primary and secondary school.8
 The main speci cation includes GPA, SAT score, student ability, and zip code
 with a published median income on the 2000 Census standard  le 3 (U.S. Census
 Bureau, 2001). Only students for which all these variables are present are included
 8See, for example, Oates (2005) for an excellent review of the concept of the market for local
 government goods. One example is Hamilton, who argues that zoning and property taxes can
 homogenize communities with respect to public service demand (Hamilton, 1975).
 52
in the sample. From a baseline of all students who qualify for the \enter" sample
 (n0=19,500), removing students with zip codes that are not tabulated (presum-
 ably to maintain the privacy of those living at the address) removes about 1,000
 individuals; removing students without a valid SAT score removes 7,300 students
 (n=10,621). The other requirements do not remove any students because they are
 derived from the transcript data itself.
 Students selected into the sample are similar to the entire \enter" sample.
 A comparison of the raw and selected groups is shown in Table 3.1 (columns 1
 and 2) and shows that the two groups are approximately balanced on race and
 gender. The academic achievement variables are slightly higher for the selected
 sample{for example, GPA is 0.03 higher and student ability is 0.05 higher{re ecting
 a small increase in typical student quality. The exception to the increase in academic
 achievement is that the students in the sample have a slightly lower average high
 school GPA. Without controlling for the high school the student attended, it is not
 possible to know if this represents increased di culty at their particular high schools
 or a lower level of achievement. In any case, the di erence is small.
 Not all of the students on the sample have taken the SAT. Some of the stu-
 dents who did not take the SAT took the ACT instead. It is possible to use a
 conversion factor to estimate an SAT score from an ACT score. Wainer (1986) ob-
 serves that while such conversion is possible, it is inaccurate for individuals, even if
 accurate when averaged over large groups. Wainer published his conversion factors
 but since the publication date Educational Testing Service (the owner of the SAT)
 decided to \recenter" the SAT periodically, essentially updating the scores to be
 53
normally distributed with a relatively invariant mean and standard deviation by
 applying periodic adjustments to the raw score on the SAT scale (Dorans, 2002).9
 This recentering renders conversion factors based on pre-recentering data (such as
 Wainer?s) inappropriate. Using a subsample of the students who took both the
 SAT and the ACT and ignoring possible selection bias associated with using this
 self-selected group, it is possible to develop a conversion factor based on a linear
 model (Table 3.2). These conversion factors are not ideal. The R2 is about 0.7 for
 the math and verbal imputation scheme so only about 70% of the variation in SAT
 scores explained by the ACT scores.
 The observable academic measures (including their SAT imputed from ACT
 scores) are di erent from the selected group (Table 3.1, column 3). The students
 who took only the ACT have lower (imputed) SAT scores, higher family income, and
 higher ability than the students who took the SAT. In light of this, an additional
 regression is run with these students included as a speci cation check.
 Subsamples of the \degree" and \full" samples are also used as a speci cation
 check. The subsamples are created using the same sample selection as the \enter"
 group described above. The \degree" sample is created in a similar way to the \en-
 ter" sample except that an attempt is made to winnow it to students who probably
 graduated by limiting the sample to students who were continuously enrolled for
 the entire sample or who completed over 120 credits.10 Even more than the \enter"
 sample, the \degree" sample and the selected \degree" sample are very similar to
 9The scores are centered so the mean is about 500 and the standard deviation is about 110 for
 each test.
 10A full description appears in chapter 2.
 54
each other (Table 3.3).
 The \full" sample is similar to the \enter" sample but removes the requirement
 that the student started taking classes by 2005. For the \full" sample there is a non-
 trivial increase in student ability (ai) in the selected sample, with an increase of 0.10,
 this is much larger than in the other samples (Table 3.4). While larger than the
 other changes, it is still not very large, and has no obvious e ect on the regression
 results shown.
 3.3 Results
 The main regression is
 Y =  +  1  [SAT] +  2  [family income] +  (3.7)
 where Y is the outcome of interest (GPA or student ability). The regression asks
 whether, controlling for SAT scores, family income is associated with higher or lower
 GPA or student ability. The family income coe cient from this regression shows
 that higher incomes are associated with slightly lower GPAs and student ability
 (Table 3.5). The estimated change in GPAs from a $10,000 higher family income
 is  0:013. The estimated change in student ability from a $10,000 higher family
 income is  0:012.
 As an example, consider two students, one with average family income, the
 other with a two standard deviation higher income family ($27,000 higher) and
 both students have combined SAT scores of 1210. The lower income student will
 55
be expected to earn 0.035 higher GPA, or have an estimated 0.032 higher ability.
 A third student from the same lower income family but with 24 points lower SAT
 score (a 1186) would be expected to match the GPA of the higher income student.11
 Running this regression by  tting just the family income regressor (dropping
 SAT) reveals a simpler explanation for the negative coe cient on  2: lower income
 students are also unconditionally higher performing (Table 3.6). The regression
 coe cient of family income unconditionally is  0:021 when GPA is the outcome and
  0:022 when student  xed e ect is the outcome. The R-squared on this regression
 is almost exactly zero, suggesting that family income is not a strong determinant of
 grades.
 When similarly running the regression with just SAT the regression coe cients
 are almost unchanged relative to the results when including family income.
 An apparent reason for this negative coe cient on  2 when SAT is dropped
 might be that while nationally SAT and family income are positively correlated
 (College Board, 2009), at the University of Maryland, the family income and SAT
 scores are almost uncorrelated. In fact, there is a small negatively correlation co-
 e cient ( 0:04). This means that lower income students have higher SAT scores.
 This raises the possibility that they also have higher unobservable quality and this
 is what drives the negative coe cient on family income when conditioning on SAT.
 In speci cation 3.7, I lumped together SAT math and SAT verbal scores as if
 they test two skills that are equally useful for increasing student output. However,
 11Di erences in SAT scores can only be denominated in units of 10, so a technically correct
 statement would be that a group of students from the same lower income family with average SAT
 scores of 1186 would be expected to perform the same as this hypothetical third student.
 56
they might be di erent and this is tested by separating out the SAT scores in the
 regression
 Y =  +  1v  [SAT verbal] +  1m[SAT math] +  2  [family income] +  (3.8)
 Adding in SAT math and verbal scores separately does not change the results in a
 statistically signi cant way for the GPA estimates, meaning that the simper model
 with the two tests lumped together cannot be rejected as less explanatory. The best
 estimates from regression equation 3.8 suggests that an increase of 100 points in
 SAT verbal scores increases GPAs by 0.155 while a 100 point increase in SAT math
 scores increases GPA by 0.139 (Table 3.7). The insigni cant t-test on the contrast
 between the two regressors indicates that SAT math and SAT verbal do not have
 di erent e ects on GPA.
 In the same speci cation, using student ability as the outcome variable, the
 addition of SAT math is statistically signi cant and has a much stronger association
 with student ability than SAT verbal scores. An increase of 100 points in SAT verbal
 scores is associated with an increase in ability of 0.421 while an increase of 100 points
 in SAT math scores is associated with an increase of 0.580 in student ability (Table
 3.7).
 This result is striking. Despite the previous chapter?s result that GPAs are
 a good measure of student ability, GPAs are not a ected di erently by SAT math
 and SAT verbal scores, but student ability is.
 For both possible outcome variables, the main results|changes in GPA based
 57
on family income|is not a ected by breaking SAT down into math and verbal
 scores.
 The most obvious explanation for why these two di er is that the di culty of
 the classes that students take varies by family income and SAT math score. To verify
 that, I regressed students? average class di culty (  di) on the same regressors using
 the same speci cations (Table 3.8). These results show that average class di culty
 matters. Students with relatively higher SAT math scores are taking harder classes
 than their counterparts with lower SAT math scores. An increase of 100 points
 in SAT math increases average class di culty by 0.093.12 Higher total SAT verbal
 scores are associated with a slight decrease the average class di culty of 0.023. Both
 the coe cients on SAT math and SAT verbal are statistically signi cant at the 1%
 level, as is the contrast between the two.
 An increase in family income increases di culty, if very slightly, by 0.003,
 regardless of the speci cation and this result becomes statistically signi cant at
 the 10% level when including SAT math and SAT verbal separately. This suggests
 that it is possible that higher income students are taking more di cult classes, on
 average, even if just slightly.
 To test the robustness of the main results, I performed additional checks.
 First, family income might not enter entirely linearly, for example, if the e ect ac-
 crues largely at the top or bottom of the income distribution. I ran a regression
 with the family income binned by quintile, and the bins are included as dummy
 12Note that one might have hoped that the increase in ability minus the increase in di culty
 might exactly equal the decreased GPA but it does not. However, a null hypothesis that they are
 equal cannot be rejected.
 58
variables in lieu of the linear estimator (Table 3.9). When this is done, the e ects
 are approximately linear for GPA and noisy but still approximately linear for abil-
 ity. In the case of GPA the estimated change in GPA by income quintile decrease
 monotonically with an average decrease of 0.014 from bin to bin. When using stu-
 dent ability as an outcome, the estimated e ects decrease monotonically, with only
 one exception at the lowest income group. Overall, the non-linearities appear to be
 relatively subtle, making an assumption of linearity reasonable.
 In the previous chapter, I used ordered logit as well as OLS to estimate the
 decomposition in equation 3.5 but there are theoretical reasons to prefer the ordered
 multinomial logit to the OLS results even though they were very similar in  nal  t.
 Running the regression on the ability measure shows slightly di erent results. The
 estimated regression coe cient is still negative at  0:019 but is not statistically
 signi cant, with an associate t-statistic of 1.6. Note that the di erent absolute
 value of the estimated coe cient between the OLS-based student ability and ordered
 multinomial logit-based student ability is not itself interpretable because the two
 ability measures are not measured on the same scale.
 Students who did not take the SAT but did take the ACT are not in the
 sample, despite the fact that it is possible to impute their SAT scores using the ACT.
 Imputed SAT scores are more imprecise and that can bias a regression coe cient
 towards zero. However, there is a competing concern that these students were not
 matched exactly on their baseline characteristics. Running the regression including
 these students not only does not move the estimate closer to zero; instead it decreases
 it slightly to  0:013, and it becomes signi cant at the 1% level (Table 3.10). This
 59
suggests that removing these students did not bias the sample towards negative and
 more signi cant results.13
 The \enter" sample includes most students who started college at the Univer-
 sity of Maryland in 2003 through 2005 regardless whether they ultimately graduated.
 Running the regressions using just those who graduate gives very similar results (the
 \degree" sample), except that when GPA is the outcome, the estimate for family
 income increases slightly in absolute terms to  0:015 and remains signi cant (at
 the 1% signi cance level) when predicting the student  xed e ect (Table 3.14). Us-
 ing the \full" sample, which includes students who entered after 2005 also shows a
 nearly identical result as the \enter" sample, with an estimated coe cient on family
 income of  0:012, a result that is again statistically signi cant at the 1% level.
 The loss of signi cance using the ordered logit-based estimate of ability may
 have been a result of marginal signi cance of the original regression coe cient.
 Rerunning the regressions on the \degree" and \full" samples using the logit shows
 that both have a statistically signi cant estimate of family income (Tables 3.14 and
 3.15, respectively).
 3.4 Discussion and Conclusion
 These results show that, at the University of Maryland, conditional on SAT
 scores, lower income is associated with higher college output. However the same is
 also true unconditional on SAT scores. In any case, the result is relatively modest;
 13When a noisy measure of a regressor is used in place of a true value the regression coe cient
 is biased towards zero. What is worse, as other regressors are added, the bias increases (Griliches,
 1977).
 60
a $10,000 increase in household income (approximately one standard deviation) is
 associated with a decrease of about 0.01 in GPA{suggesting that the family income
 e ect identi ed here, even if causal, would not dominate students? college perfor-
 mances.
 Methodologically, this result holds when college output is measured directly
 via GPA and ability, which controls for the di culty of the classes in which students
 enroll. This suggests that GPA and ability could have been used interchangeably
 to get the same results in this chapter. However, in one speci cation, a weakness
 of measuring collegiate output with GPA was revealed. An increase in SAT math
 is associated with an increase in class di culty and an approximately balancing
 increase in student ability. In contrast, an increase in SAT verbal is associated with
 a smaller increase in student ability and a slightly negative change in average class
 di culty. For this chapter, this asymmetry was irrelevant to the conclusion, but
 one could imagine a case where it is important. Because of this, using GPA as a
 proxy for ability still may make sense, but additional consideration should be given
 as to whether relative math ability might play a role in the particular question being
 asked.
 At the beginning of this chapter I posited two students who received the same
 SAT scores, one from a lower income family and another for a higher income fam-
 ily. I then asked if innate ability, previously unrealized, would shine through at the
 University of Maryland. However, data from the University of Maryland proved
 inappropriate for answering this question because of the non-random nature of en-
 rollment.
 61
Table 3.1: Sample description
 Enter Enter sample Enter ACT
 GPA 2.80(0.747) 2.83(0.731) 2.90(0.716)
 SAT verbal 5.91(0.864) 5.91(0.873) 5.71(0.617)
 SAT math 6.24(0.878) 6.18(0.892) 6.05(0.837)
 family income
 /$10,000 4.13(1.35) 4.13(1.30) 4.19(1.43)
 ai -1.17(0.748) -1.12(0.733) -1.11(0.698)
  di -0.198(0.221) -0.199(0.218) -0.170(0.215)
 took SAT 0.599 1.000 0.000
 HS GPA 3.75(0.508) 3.72(0.512) 3.79(0.481)
 has HS GPA 0.790 0.950 1.000
 transfer 0.407 0.319 0.189
 female 0.494 0.488 0.652
 white 0.565 0.574 0.545
 n terms 8.25(3.14) 8.60(3.15) 8.77(2.86)
 n classes 34.7(12.7) 36.4(12.4) 39.1(11.2)
 total credits 101(35.6) 107(34.6) 113(30.9)
 degree seeking 1.000 1.000 1.000
 n 19,489 11,183 514
 Note: The  rst column is the entire \Enter" sample, before sample selection
 criteria for this chapter are applied. The second column is the main sample
 for this chapter. The third sample is the students who have an ACT score
 but no SAT score. Their imputed SAT scores are reported. Each entry in
 the table is an average of a variables, named in the row, followed by standard
 deviations in parentheses for non-binomial variables. Some of the sample
 selection criteria force a variable to be exactly 0 or 1.
 62
Table 3.2: Predicting SAT with ACT scores
 SAT verbal SAT math
 (1) (2)
 Intercept 234   203   
(5) (3)
 ACT reading 6:82   |
 (0:25)
 ACT English 7:16   |
 (0:29)
 ACT math | 15:7   
(0:1)
 R2 0:67 0:73
 obs 2,657 6,402
 Note: Standard errors appear in parentheses below
 each regression coe cient. Stars indicate signi -
 cance at the 1% (***), 5% (**), or 10% (*) levels. The
  rst column shows regressions of SAT verbal scores
 with various ACT tests. The second two columns
 show regressions of SAT math scores on ACT tests.
 63
Table 3.3: Sample description: degree dataset
 Degree Degree sample
 GPA 2.94(0.599) 2.96(0.583)
 SAT verbal 5.97(0.831) 5.98(0.838)
 SAT math 6.32(0.858) 6.26(0.871)
 family income
 /$10,000 4.12(1.32) 4.12(1.27)
 ai -1.03(0.635) -0.988(0.620)
  di -0.221(0.213) -0.219(0.212)
 took SAT 0.643 1.000
 HS GPA 3.80(0.493) 3.78(0.495)
 has HS GPA 0.881 0.972
 transfer 0.289 0.239
 female 0.485 0.482
 white 0.569 0.571
 n terms 9.94(2.11) 10.1(2.19)
 n classes 41.6(7.91) 42.3(7.48)
 total credits 121(20.1) 123(19.0)
 degree seeking 1.000 1.000
 n 12,995 8,017
 Note: Means and standard deviations are in parenthesis for
 non-binary variables. This table shows the degree sample
 before and after sample selection criteria are applied.
 64
Table 3.4: Sample description: full dataset
 Full Full sample
 GPA 2.82(0.717) 2.85(0.693)
 SAT verbal 5.89(0.833) 5.91(0.864)
 SAT math 6.22(0.873) 6.18(0.882)
 family income
 /$10,000 4.15(1.35) 4.12(1.30)
 ai 1.55(0.727) 1.67(0.709)
  di -0.111(0.241) -0.0792(0.238)
 took SAT 0.399 1.000
 HS GPA 3.75(0.491) 3.69(0.502)
 has HS GPA 0.785 0.945
 transfer 0.396 0.284
 female 0.483 0.479
 white 0.575 0.588
 n terms 6.23(3.19) 7.07(3.29)
 n classes 26.2(12.9) 28.7(13.2)
 total credits 76.4(37.0) 84.4(37.8)
 degree seeking 0.992 0.997
 n 63,875 24,435
 Note: Means and standard deviations are in parenthesis
 for non-binary variables. This table shows the full sample
 before and after sample selection criteria are applied.
 65
Table 3.5: Predicting outcomes with SAT and
 income
 I
 GPA ai
 SAT/100 0:147   0:187   
(0:005) (0:005)
 family income  0:013   0:012  
/$10,000 (0:005) (0:005)
 R2 0:10 0:16
 obs 11,183 11,183
 Note: The standard errors in parentheses re ect
 clustering at the zip code level. Stars indicate sig-
 ni cance at the 1% (***), 5% (**), or 10% (*) levels.
 These are predictions of GPA and student  xed ef-
 fect (described in body) using a the sum of verbal and
 math SAT and median income by zip code and are  t
 on the \enter" dataset (see text).
 66
Table 3.6: Predicting outcomes with SAT or income
 GPA ai GPA ai
 SAT/100 0:148   0:187   | |
 (0:005) (0:005)
 family income | |  0:021    0:022  
/$10,000 (0:008) (0:008)
 R2 0:10 0:19 0:00 0:00
 obs 11,183 11,183 11,183 11,183
 Note: The standard errors in parentheses re ect clustering at the zip code level. Stars
 indicate signi cance at the 1% (***), 5% (**), or 10% (*) levels. These are predictions of
 GPA and student  xed e ect (described in body) using a the sum of verbal and math SAT
 or median income by zip code and are  t on the \enter" dataset (see text).
 67
Table 3.7: Predicting outcomes with SAT math and verbal separately
 I II
 GPA ai GPA ai
 SAT/100 0:147   0:187   | |
 (0:005) (0:005)
 SAT verbal/100 | | 0:155   0:143   
(0:009) (0:009)
 SAT math/100 | | 0:139   0:229   
(0:009) (0:009)
 faimly income  0:013   0:012   0:013   0:011  
/$10,000 (0:005) (0:005) (0:005) (0:005)
 contrasts
 SAT math/100 | |  0:016 0:086   
 SAT verbal/100 (0:018) (0:018)
 R2 0:10 0:16 0:10 0:16
 obs 11,183 11,183 11,183 11,183
 Note: The standard errors in parentheses re ect clustering at the zip code level. Stars indicate
 signi cance at the 1% (***), 5% (**), or 10% (*) levels. Speci cation I is reproduced here for
 easy comparison to speci cation II.
 68
Table 3.8: Predicting average class di culty
 I II
  di  di
 SAT/100 0:036   |
 (0:001)
 SAT verbal/100 |  0:023   
(0:003)
 SAT math/100 | 0:093   
(0:003)
 family income 0:003 0:003 
/$10,000 (0:002) (0:002)
 contrasts
 SAT math/100 | 0:116   
 SAT verbal/100 (0:005)
 R2 0:07 0:11
 obs 11,183 11,183
 Note: The standard errors in parentheses re ect clustering at
 the zip code level. Stars indicate signi cance at the 1% (***),
 5% (**), or 10% (*) levels. This table estimates the  rst three
 speci cations (previous two tables) to explain average class
 di culty for each student.
 69
Table 3.9: Predicting outcomes with SAT: linearity of income
 I IV
 GPA ai GPA ai
 SAT/100 0:147   0:187   0:147   0:187   
(0:005) (0:005) (0:005) (0:005)
 family income  0:013   0:012  | |
 /$10,000 (0:005) (0:005)
 family income lowest | | 0:020 0:006
 (0:027) (0:025)
 family income low | | 0:012 0:011
 (0:029) (0:027)
 family income middle | | | |
 family income high | |  0:013  0:008
 (0:035) (0:030)
 family income highest | |  0:038  0:044 
(0:026) (0:024)
 R2 0:10 0:16 0:10 0:16
 obs 11,183 11,183 11,183 11,183
 Note: The standard errors in parentheses re ect clustering at the zip code level. Stars indicate
 signi cance at the 1% (***), 5% (**), or 10% (*) levels. This shows a test of the linearity of the
 zip code median income by breaking it down into  ve bins (the middle bin is the omitted group).
 Speci cation II is reproduced here for easy comparison with speci cation IV.
 70
Table 3.10: Predicting outcomes with SAT in-
 cluding SATs imputed from ACT scores
 I
 GPA ai
 SAT/100 0:148   0:188   
(0:005) (0:005)
 family income  0:014    0:013   
/$10,000 (0:005) (0:005)
 R2 0:10 0:16
 obs 11,753 11,753
 Note: The standard errors in parentheses re ect
 clustering at the zip code level. Stars indicate sig-
 ni cance at the 1% (***), 5% (**), or 10% (*) levels.
 This shows results when including those students who
 took the ACT and not the SAT but had their SAT im-
 puted based on their ACT score.
 71
Table 3.11: Predicting outcomes with SAT
 and income, ordered multinomial logit based
 ability estimate
 I
 GPA ai
 SAT/100 0:147   0:502   
(0:005) (0:011)
 family income  0:013   0:019
 /$10,000 (0:005) (0:012)
 R2 0:10 0:19
 obs 11,183 11,183
 Note: The standard errors in parentheses re ect
 clustering at the zip code level. Stars indicate sig-
 ni cance at the 1% (***), 5% (**), or 10% (*) levels.
 Predictions of GPA and student  xed e ect (described
 in body) using the sum of verbal and math SAT and
 median income by zip code. Fit on the \enter" dataset
 (see text).
 72
Table 3.12: Predicting outcomes with SAT
 and income, degree dataset
 I
 GPA ai
 SAT/100 0:140   0:189   
(0:004) (0:004)
 family income  0:015    0:013   
/$10,000 (0:005) (0:005)
 R2 0:13 0:21
 obs 8,017 8,017
 Note: The standard errors in parentheses re ect
 clustering at the zip code level. Stars indicate sig-
 ni cance at the 1% (***), 5% (**), or 10% (*) levels.
 This is the same model as was  t in Table 3.5 but it
 is  t on the smaller \degree" dataset (see text).
 73
Table 3.13: Predicting outcomes with SAT
 and income, full dataset
 I
 GPA ai
 SAT/100 0:123   0:171   
(0:003) (0:003)
 family income  0:012    0:012   
/$10,000 (0:004) (0:004)
 R2 0:08 0:14
 obs 24,435 24,435
 Note: The standard errors in parentheses re ect
 clustering at the zip code level. Stars indicate sig-
 ni cance at the 1% (***), 5% (**), or 10% (*) levels.
 This is the same model as was  t in Table 3.5 but it
 is  t on the larger \full" dataset (see text).
 74
Table 3.14: Predicting outcomes with SAT
 and income, degree dataset, ordered multino-
 mial logit ability
 I
 GPA ai
 SAT/100 0:140   0:525   
(0:004) (0:011)
 family income  0:015    0:025 
/$10,000 (0:005) (0:013)
 R2 0:13 0:23
 obs 8,017 8,017
 Note: The standard errors in parentheses re ect
 clustering at the zip code level. Stars indicate sig-
 ni cance at the 1% (***), 5% (**), or 10% (*) levels.
 This is the same model as was  t in Table 3.5 but it is
  t on the smaller \degree" dataset (see text) and the
  xed e ects taken from the ordered multinomial  t.
 75
Table 3.15: Predicting outcomes with SAT
 and income, full dataset, ordered multinomial
 logit ability
 I
 GPA ai
 SAT/100 0:123   0:469   
(0:003) (0:007)
 family income  0:012    0:018  
/$10,000 (0:004) (0:009)
 R2 0:08 0:17
 obs 24,435 24,435
 Note: The standard errors in parentheses re ect
 clustering at the zip code level. Stars indicate sig-
 ni cance at the 1% (***), 5% (**), or 10% (*) levels.
 This is the same model as was  t in Table 3.5 but it is
  t on the larger \full" dataset (see text) and the  xed
 e ects taken from the ordered multinomial  t.
 76
Chapter 4
 Estimation of Large Ordered Multinomial Models
 When estimating an equation with grades on the left hand side there are
 theoretical reasons to think an ordered multinomial estimator (OME) is preferable
 to the simpler ordinary least squares (OLS) estimator. An OME acknowledges that
 there is a maximum grade, so when a good students takes an easy class the prediction
 will not be higher than the an ?A? grade (an ?F? is treated in the same way for low
 predictions). At the same time, an OME acknowledges that spacing between grades
 need not be identical, so that the range of performance encompassed by ?B? grades
 need not be the same as ?C? grades. This is in contrast with OLS using the awarded
 grade points as the outcome variable which can, for example, predict that a student?s
 grade be worth 5 grade points on a 4 point scale, and cannot accommodate di erent
 widths of grades.
 I assume grades are generated as follows. Each student has a true ability (ai).
 This ability level is then observed in an individual class as a noisy signal
 ~aij = ai +  ij; (4.1)
 where  ij accounts for noise introduced into the observation in the grading process.
 The observed ability is not a grade but can be thought of as either a number in
 a ledger or possibly a yet unquanti ed assessment of a  nal paper. This is then
 77
mapped to a grade using \cuto s" that separate observed ability levels (~aij) into
 grades. For example, the ?A?/?B? cuto is the location where students with observed
 ability above the cuto will receive ?A?s and student just below the cuto will receive
 a ?B?s or lower grades.
 Imagine two students, named 1 and 2, with ability levels a1 and a2, respectively
 (Figure 4.1, top scale). In a class c their observed ability level is given by a1c and
 a2c (Figure 4.1, middle scale). Based on their observed ability levels, these students
 are then assigned grades ?F? and ?A?, respectively (Figure 4.1, bottom scale).
 Theoretically, each class could have its own cuto s for each grade{one class
 might have a small cuto for each grade while another might allow a large range of
 observed ability to fall into an individual grade. However, the concept of di culty
 is cleanest if all the classes share the size of the ranges for the values for each grade,
 but all the ranges move up and down together (Figure 4.2), and for tractability I
 will assume that this is true. Intuitively, this model allows for a university wide
 agreement on the size of the ?B? range as well as changes in the exact position of
 each of the grades.
 Given their locations, one would expect student 1 to receive mainly ?D? grades
 and student 2 to receive mainly ?B? grades. The likelihood function for student 1
 is given by the location of a1, the distribution of  , and the location of the grade
 boundaries. To illustrate this, Figure 4.3 shows the distribution of a1, shading the
 area that is proportional to the probability of student 1 receiving a ?D? grade.
 The OME estimates the model by maximizing the likelihood function where
 78
the probability of each grade is given by:
 L (Gja; d)ij = Pr (grade Gij observed; conditional on ai; dj) (4.2)
 =
 Z
 Gij
 Pr (~aij  dj = z) f(z)dz (4.3)
 =
 Z
 Gij
 Pr (ai +  ij  dj = z) f(z)dz (4.4)
 =
 Z
 Gij
 Pr ( ij = z  ai + dj) f(z)dz (4.5)
 where the bounds of the integral are the cuto s for the actual grade assigned to
 student i in class j, and f(z) is the density function of the distribution of  (Figure
 4.3).1 This leads to a regression of the form
 Gij = ai  dj +  ij (4.6)
 where Gij is the grade, ai is the ability of the student, dj is the di culty of the
 class, i indexes the student and j indexes the class.
 When the grading is more \di cult," the observed ability level required to
 get each grade is higher. This corresponds to a rightward movement of the bottom
 scale in Figure 4.2. In the  gure, class B is more di cult than class A because, for
 example, some observed performance levels that are assigned a ?D? in class A are
 assigned a ?F? in class B.
 The terms \ability" and \di culty" are used somewhat loosely here, because
 1The models that describe these models are the ordered multinomial probit when  is normally
 distributed and ordered multinomial logit when  is Weibull distributed. To create a general term,
 I call these estimators ordered multinomial estimators (OME) and, treat the distribution of  as a
  tted parameter as in McCullagh & Nelder (1989).
 79
so many factors play into them. For example, a class with good pedagogy might
 improve every student?s output and thus would appear as a less di cult class ceritis
 paribus. Similarly, a student who spent more time on school work could potentially
 improve his or her \ability."2
 Despite the advantage that this model describes the data generating process
 in a satisfying way, the OME can have biased estimates in the  xed e ects context if
 the number of observations per unit (student) is not \large" (McCullagh & Nelder,
 1989). However, the number of observations required to meet the threshold of being
 \large" is not well known.3 Because of this, the OME is not obviously the best
 estimator.
 Both the OLS and OME estimators have theoretical problems|OLS incor-
 rectly models the grading process as continuous, which it is not, while the OME
 does not have an applicable consistency proof|so how exactly to proceed when
 estimating a grade decomposition is not obvious. The remainder of this chapter
 explores questions surrounding bias of the estimators in simulations (section 2), and
 the role of bias in signi cance tests for the OME (section 3), and the di culty in
 estimation of the OME (section 4). A  nal section concludes.
 4.1 Simulation of estimators
 It is well known that a regression of the form (eq. 4.6) need not be consistent
 when there are  xed e ects that increase the sample size. A literature has devel-
 2This is something that appears to be possible given the low time investment made in college
 (Babcock & Marks, 2010).
 3See Green (2002) for a review.
 80
oped around estimating this type of problem (Ferrer-i-Carbonell & Frijters, 2004;
 Chamberlain, 1980). One related estimator is described by Chamberlain (1980).
 His method is intended for estimating parameters of interest in the presence of  xed
 e ects for binomial and multinomial estimators, treating the  xed e ects as nui-
 sance parameters.4 The method substitutes a term for each  xed e ect by unit with
 conditioning on the margins/totals of the outcomes by unit.5 A problem with this
 approach is that the conditional term involves calculating 106 to 1032 of terms per
  xed e ect{an infeasible task.6 Substantial work has been done to minimize the
 computational e ort while allowing for very relaxed assumptions about the data
 generating process by Ferrer-i-Carbonell & Frijters (2004). Even the least intensive
 of the non-linear methods described by Ferrer-i-Carbonell and Frijters requires far
 too much computation for problems with as many observations as a typical student
 in this data (circa 40 observations)
 Another complicating factor is that unlike in the existing literature, the results
 herein are not regression parameters themselves, but relationships between them.
 The main results of interest include correlations between all possible pairs of GPA,
 ability, and di culty, and the ability expansion path. Since the correlations are
 location and scale invariant, a bias in the estimator that doubled one set of  xed
 4An issue that hampers but does not exclude the method from consideration, as described
 below.
 5Chamberlain shows that the resulting estimator is consistent for the remaining parameters.
 However, herein, the  xed e ects are the parameters of interest. Nonetheless, the method described
 by Chamberlain could be used to estimate the student  xed e ect netting out the class  xed e ects
 and then estimate the class  xed e ects netting out the student  xed e ects.
 6A second method discussed by Chamberlain (1980) is a probit with random e ects. While the
 model does not assume that the  xed e ects are uncorrelated, it does treats the main results of
 this thesis (the relationship between the  xed e ects) as a nuisance parameter problem in a way
 that is integral to its nature.
 81
e ects would not change the correlations. Because of this the exact properties of the
 estimator of the statistics I use are not well known. When faced with an estimator
 that is known to not have all the desirable properties, one possible way of quantifying
 its performance is simulation (Heckman, 1981; Green, 2002).
 To investigate the properties of the OME and OLS, three cases are simulated
 by generating student and class  xed e ects randomly using the OME model de-
 scribed above. Estimates of the  xed e ects a^ and d^ using the OME and OLS
 estimators and statistics later calculated in the results are compared to their true
 value (in the simulation). The exact simulations are chosen to show how well these
 estimators perform at identifying bias in grades in simulations where there is and
 is not bias. The three di erent cases test these estimators in di erent situations
 (Table 4.1). In the  rst simulation, student ability and class di culty are uncor-
 related (\uncor"); in the second simulation, student ability and class di culty are
 strongly correlated (\cor"); in the third simulation, the  rst simulation parameters
 are modi ed so that a large number of students receive 4.0 GPAs (\high GPA").
 The parameters used to generate these simulations are also shown in Table 4.1.
 An individual class?s grade can be described by its Grade Points (GP ) and the
 average of GP is determined by the di erence in E(a) and E(d),7 and because of
 upper and lower limit e ects, to a lesser extent by the variance covariance matrix of
 a and d. The cuto s between grades are kept linear{the space between the b values
 are always exactly one{so the e ect of unequal spacing in grades is not investigated.8
 7Here expectations are taken over the population of transcript entries.
 8Some investigation of the importance of non-linearities showed little di erence between OLS
 and the OME in estimating when they were present.
 82
In the results, I summarize the data with Pearson correlation coe cients be-
 tween student ability (ai) and average class di culty (  di); the performance expansion
 path (a plot of GPA versus average class di culty); and Goodman-Kruskal correla-
 tion coe cients (  G) between class di culty (dj) and student ability (ai). The re-
 lationship between estimated derived statistics (i.e., Pearson correlation coe cients,
 performance expansion path, and the Goodman-Kruskal correlation coe cients) and
 the true (simulated) values is used to judge the estimators.
 Looking at the simulated results, both the Pearson type correlation coe cients
 ( ) and the Goodman-Kruskal (  G) are reliably estimated in almost every instance
 (Table 4.2). When the correlation between ability and GPA is high, the estimated
 correlation is high, and the converse is also true.
 However, the standard deviations of ability and class di culty are not as
 reliably estimated. For the \uncor" and \cor" simulations, the OME estimates of
 the standard deviation on student ability (ai) are slightly high while none of the
 OLS estimates are accurate, with all of them being too small.
 Removing the approximately 2,000 students ( 6%) with 4.0s (Table 4.3) does
 not negatively impact the estimates in the \uncor" and \cor" cases, but does improve
 the OME estimates of the parametric standard deviation and Pearson correlation.
 Another summary of the data used in the results is the performance expan-
 sion path. The estimated and true performance expansion paths are plotted for the
 \uncor" simulation (Figure 4.4), the \cor" simulation (Figure 4.5), and the \High
 GPA" simulation (Figure 4.6). These  gures are generated by separating the stu-
 dents into vigentiles (equally spaced twentieths) by true/estimated student ability
 83
(ai / a^i) and the average GPA and average class di culty (  di) are calculated within
 each bin. The 4.0s are always segregated into a single bin, so that there are 21 total
 bins.
 These  gures show that when grades are unbiased (\uncor") the estimated
 ability expansion path is vertical and that when the grades are biased by higher
 ability students taking harder classes (\cor"), the estimated ability expansion path
 is slanted to the right. In the results, these shapes indicate these outcomes.
 In the \uncor" and \cor" simulation, both estimators  nd the true vertical
 line when uncorrelated (Figure 4.4) and the slanted line when correlated (Figure
 4.5). The only exception is the high estimate from the OLS estimator near 4.0 for
 the \uncor" simulation and a mild attenuation of the slope in the correlated case.
 In the case of the \High GPA", similar to \uncor", both estimators again
 perform well, correctly identifying the vertical performance expansion path (Figure
 4.6). However, both estimators overestimate the size of a hook to the left, near 4.0.
 Interestingly, the OME then accurately estimates the average di culty of classes for
 those with 4.0s.
 These simulations show that both estimators are reliable, though there are
 situations where the OME is accurate while the OLS estimator is not. Removing
 4.0s from the analysis is required to get accurate Pearson type correlation coe cients
 and standard deviation estimates, though a su ciently large number will also bias
 the estimated performance expansion path near the top of the GPA scale.
 84
4.2 Signi cance testing
 Typical signi cance tests for multiple regressors rely on an F-test. This is not
 possible for the OME. Instead, a likelihood-ratio test (LR) can be used
 LR = 2  (?l  ?s) (4.7)
 where ? is the log-likelihood of the  tted models, and subscripts l and s are for a
 larger and a nested smaller model, respectively. Under the null hypothesis that the
 k additional variables in a larger model are all equal to zero,9 the likelihood-ratio
 (LR) test statistic is chi-square distributed with k degrees of freedom under the mild
 assumption that there are many observations for each of the k variables being tested
 (McCullagh & Nelder, 1989). Unfortunately, the term many is not well de ned.
 When there are not su cient observations the likelihood ratio test statistic
 will not produce accurate p-values. This is because panel data analysis generates
 biased estimates of  when there are individual  xed e ects and the outcome is
 discretized (Hahn & Newey, 2004). For estimates of the type
 Y  it = Xit + FEi +  it; (4.8)
 where i subscripts are used to indicate individuals in the panel and t subscripts
 indicate the time variable, FEi is a  xed e ect for each unit, and the observed
 outcome Yit is a discretized version of Y  it . This bias is often described in a Taylor
 9It is also possible to test a hypothesis that the values are not equal to zero, but I use the simple
 (and applicable) example of zero here.
 85
series of the form
  ^ =  +
 B
 T
 + o
  1
 T 2
  
(4.9)
 For related non-linear models, methods of removing the  rst order bias term
  
B
 T
  
exist, for example Hahn & Newey (2004).
 The correlation coe cients between  xed e ects are not biased by  rst order
 bias terms in the estimates of  . However, for looking at signi cance tests, this
 issue cannot be dismissed. Bias of the form in eq. 4.9 makes the likelihood ratio
 test non-central  2 distributed instead of central  2 distribution and the associated
 tests would be biased towards rejection. This would make the actual type I error
 rate higher than intended.
 The remainder of this section details existing methods in the literature and
 their applicability to testing grades data.
 One method for estimating a statistic of unknown distribution is to use the
 bootstrap (Efron, 1982) resampling scheme to estimate a con dence interval for
 any statistic, largely free of assumptions about the distribution of the error term
 in equation 4.8. In particular one might bootstrap the test statistic for the joint
 hypothesis that a set of  xed e ects are all zero (Fox, 2008).
 There are a few problems when using the bootstrap. First, the sample must
 be redrawn to represent the original sampling scheme{but when there is a census, as
 in this dissertation, it is not obvious how this can be done. Another problem with
 using the bootstrap is that it requires re tting the data about one hundred times,
 86
compounding the already di cult task of  tting the regressions.
 Because of these di culties, I did not use the bootstrap to attempt to perform
 the likelihood ratio (signi cance) tests.
 Another method for constructing con dence intervals is empirical likelihood
 (Owen, 2001) which does not use normal theory in developing con dence intervals.
 However, this method requires solving a number of linear equations that increases
 with the sample size (Jing et al., 2009) and is therefore infeasible for a large dataset.10
 There are also analytic methods of removing the bias in equation 4.9 (Hahn
 & Newey, 2004; Fernandez-Val, 2009), but these authors do not propose a method
 of testing joint signi cance of several parameters.
 Finally, the jackknife is a well known method of removing the bias term from
 the estimates. There is very little theory for the jackknife for non-linear estimators,
 but as Wolter (1985) points out, since most non-linear systems behave like linear
 systems in the neighborhood of the solution, presumably the theorems regarding
 linear estimators suggest that the jackknife will still be helpful for non-linear esti-
 mators. In addition, both of the analytic methods papers also include simulations
 where the jackknife is included. In once case the jackknife estimator has lower bias
 than any other estimator but a higher standard deviation (Fernandez-Val, 2009). In
 another, the jackknife estimator always has the lowest bias and standard deviation
 (Hahn & Newey, 2004). While simulations of this type are only suggestive, certainly
 the jackknife is not an obviously inferior choice.
 10The method presented in Jing et al. (2009) need not apply to the question at hand without
 additional proofs and so is not as useful for estimating a potentially misspeci ed model.
 87
Using a block jackknife reduces the number of times that the statistic must
 be recalculated, and speeds calculation further. The test statistic is analogous to a
 ANOVA analysis where the variance-covariance matrix for the estimated values is
 used to test the joint hypothesis that several variables are simultaneously equal to
 zero (Duncan, 1978; Matlo , 1980). Constructing this test statistic is problematic in
 this context because they variance-covariance matrix would be very large and dense,
 and therefore require massive amounts of memory to store and use when there are
 many observations.
 Instead, the suggestion of Fox (2008), presented in the context of bootstrap-
 ping, is to use each resample to generate the test statistic instead of the  tted values,
 and this idea can also be used for jackknife estimators.
 The traditional jackknife estimator, applied to a maximum likelihood estima-
 tor for a statistic  , is
 b JK = nb MLE +
 (n 1)
 n
 nX
 i=1
 ~ (i) (4.10)
 where the subscript MLE is used to indicate the maximum likelihood estimator,
 and the subscript (i) indicates b MLE estimated with the ith jackknife replicate.11
 In the context of this thesis, the \ith jackknife replicate" is the entire sample,
 but with one of the school years (indexed by i) of data removed. Thus one jackknife
 replicate would remove the 2004-2005 school year and reestimate the regression as
 if that year never occurred.
 11For a more complete description of the jackknife, see Efron (1982).
 88
The jackknife estimator for the likelihood ratio test is
 dLRJK = ndLRMLE +
 (n 1)
 n
 nX
 i=1
 gLR(i) (4.11)
 gLR(i) = 2(?l(i)  ?s(i)) (4.12)
 where ?l(i) and ?s(i) are the likelihoods of the large and small models with the ith
 jackknife replicate removed.
 The issue of how to choose blocks persists since, similar to the bootstrap,
 blocks should be drawn according to the sampling scheme. The most defensible
 choice is to use an academic class year as a block. This also has the desirable
 property of leading to a small number of jackknife replicates.
 4.2.1 Simulation
 The theoretical critiques of the use of the likelihood ratio statistic state that
 there is a problem with the test but not the importantance nor magnitude of the
 problem. The question remains how poor the p-values will be for actual testing.
 To answer this, I use a simulation. In the simulation data is generated accord-
 ing to
 Y  ij =  ij (4.13)
 where the observed  ij is independently and identically normally distributed with
 89
mean zero and standard deviation 1, Y is a discretized version of
 Y = f(Y  ) =
 8
 >>>>>>>>>>>>><
 >>>>>>>>>>>>>:
 F if Y    1
 C if  1 < Y   0
 B if 0 < Y   1
 A if 1 < Y  
(4.14)
 I then perform two hypothesis tests, sharing a null but with di erent alternatives
 H0 : Y
  
ij =  ij
 HA1 : Y
  
ij = xi +  ij
 HA2 : Y
  
ij = zj +  ij
 Thus, in this simulation, the null hypothesis is true. There are I values of i and
 J values of j, and the parameter I is varied between eight and thirty-two while J
 is  xed at 500. In the simulation I test the hypothesis that all of the xs are zero
 jointly using the likelihood ratio test
 LR1 = 2  (?1  ?0) (4.15)
 LR2 = 2  (?2  ?0) (4.16)
 where ?0 is the likelihood when the  t take the form
 Y  ij =  ij; (4.17)
 90
?1 is the likelihood when the  t take the form
 Y  ij = xi +  ij; (4.18)
 and ?2 is the likelihood when the  t take the form
 Y  ij = zj +  ij: (4.19)
 The cuto s for the discretization are  t in each model as well.
 Under the null, this test statistic should be  2 distributed with I or J degrees
 of freedom. These tests are run using the likelihood ratio test based on the maximum
 likelihood estimator (eq. 4.7) and the same statistic jackknifed (eq. 4.11).
 Usually, a simulation can be criticized because the speci c form of the simula-
 tion a ects the outcomes. However, in this case the idea is to test the Type I error
 rate of an estimator, so the simulation generates data where the null hypothesis is
 true and  nds the rate at which the signi cance test rejects the null hypothesis.
 There is only one way for the null hypothesis to be true, so the simulation captures
 this situation completely.
 In the simulation, one thousand runs are done where data are generated and
 then the tests performed with a signi cance level of 0.1. I then tabulate the fraction
 of the time that the test was rejected. Ideally, this would be 10% of the time, or a 0.1
 in the table. Values larger than this indicate a test that rejects the null hypothesis
 too often, that is, the test is not su ciently conservative. A value smaller than this
 91
indicates a test that does not reject the null hypothesis often enough, the test is too
 conservative.
 Because there are one thousand replicates, if the true value were 0.1, the
 standard error on the mean would be about 0.01, so values between 0.08 and 0.12
 would be in a 95% con dence interval built around the null hypothesis. Thus values
 outside of this can be considered to not agree with an ideal value of 0.1.
 The results of the simulation show that the likelihood ratio test behaves as
 one might suspect, it is too large (not appropriately protective against Type I error)
 while I is small, but then as I increases, it does work correctly as I becomes \large"
 and even becomes too conservative at I = 32. In contrast, the bias corrected
 jackknife tests generally reject about 10% of the time (Table 4.4), but this is rarely
 within the con dence interval, which extends about 0.02 above and below every
 value in the table.12 The jackknife test statistics are close but probably not exactly
 distributed as described in the test-statistic, however they provide much more robust
 protection for small values of I.
 4.3 Computation
 Estimating  xed e ects in a non-linear framework poses a computational chal-
 lenge (Green, 2002; Heckman, 1981). The basic problem is to  nd the maximum
 12The con dence interval extends 0.02 above and below 0.10 when the actual value is 0.10. In
 other cases on the table, this value is approximately correct.
 92
likelihood estimator using the log-likelihood function for the OME
 ?( ;Y;X) =
 X
 mn
 = Ymn log( mn( ;X)) (4.20)
 where m indexes the observations13 and n indexes the possible grades;  is the
 regression coe cients (including the  xed e ects); Ymn is 1 if a student received the
 grade indexed by n in the observation indexed by m and is 0 otherwise; X are the
 regressors (such as the class and student  xed e ects); and  ( ;X) is the probability
 of observing a particular grade, given X and  .
 The algorithm used to maximize the likelihood is a Newton method optimizer.
 The basic algorithm  nds roots (zeros) in f( ) by iteratively solving
  k+1 =  k  [f
 0( k)]
  1f( k); (4.21)
 where k is the iteration index, f( ) = @?( )@ , the vector of  rst derivatives of the
 likelihood with respect to the regression coe cients14 and f 0( ) = @
 2?( )
 @ 2 = H( ),
 the Hessian matrix of second derivatives of the likelihood function with respect
 to the regression coe cients. This equation is easily veri ed as solving OLS in a
 single step by plugging in the  rst and second derivatives f( ) = XT (y  X ) and
 f 0( ) =  (XTX).
 In the case of the OME, assuming information equality (Nelder & Wedderburn,
 1972), the Hessian matrix is a function of  , and so no such simpli cations occur to
 13An observation is a student taking a class and receiving a grade.
 14Also called the  rst order conditions.
 93
eq. 4.21
 f 0( ) = XT[ ( )]X; (4.22)
 where  ( ) is a block diagonal weighting matrix that varies with  .15
 The typical method of maximizing this equation, essentially due to Fisher
 (Bliss, 1935), is to calculate the Hessian H( k) at every step (Nelder & Wedderburn,
 1972). However, this method requires calculating and inverting the Hessian matrix,
 a tasks that grows cubicly with the number of regressors. For the majority of the
 problems presented in this dissertation, a sparse matrix package provided with the R
 programing language makes this a tractable task. There are two ways this package
 improves performance, by decreasing storage space for a matrix from o(m2) to o(k)
 where k is the number of non-zero entries, and secondly by using automated methods
 to ideally row reduce the Hessian matrix. However, some  ts are too large for even
 these methods to handle. Storage of the inverse Hessian is an issue not improved
 by using sparse matrix methods because it is a dense matrix (there is no entry that
 is necessarily zero). For the largest problems in this paper this is not feasible so an
 alternative method must be used.
 There are several methods proposed to get around this. Green proposes a
 method based on the binomial matrix inversion theorem that does not require in-
 version of or storage of a large matrix (Green, 2002). Both of these apply to a
 single set of  xed e ects. Green?s method cannot be readily extended to multiple
 15The exact form of  ( ) appears in McCullagh & Nelder (1989).
 94
sets of  xed e ects. Another method proposed by Heckman estimates the equation
 in two steps (Heckman, 1981). In one step the likelihood function is maximized
 with respect to the  xed e ects, and in the other the  xed e ects are maximized
 with respect to all other parameters. Heckman?s method has been successfully ex-
 tended to a multiple  xed e ects problem by Arcidiacono et al. with linear grades
 (Arcidiacono et al., 2011).
 A third alternative is used by Abowd et al. who invert the Hessian matrix
 using graph theory (Abowd et al., 2002). While there is no general theory of the
 complexity of this problem, this method is too memory intensive for this application,
 perhaps because classes and students weave a denser interrelationship web than
 workers and  rms.16
 The L-BFGS17 both avoids constructing the full inverse Hessian and keeps
 only a local approximation the the Hessian (Liu & Nocedal, 1989; Zhu et al., 1997).
 It does this by storing only the most recent updates to the Hessian, dropping those
 that are over a pre-determined number of steps old. Because results from many prior
 steps are thrown out, the Hessian approximation is always local to the current best
 guess solution, an advantage when the system in question is only locally quadratic.
 Also, because a rank two update is the outer product of two vectors, the vectors can
 be stored instead of the matrix and the storage requirement is only O(n) instead of
 the O(n2) required for a Hessian matrix.18
 16An automated version of this is used in the R Matrix package and it does not su ciently speed
 up row reduction/inversion (Bates & Maechler, 2011).
 17L-BFGS is a modi ed version of BFGS, so named because it was separately simultaneously
 discovered by four authors (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970).
 18See Liu & Nocedal (1989); Zhu et al. (1997) for a description of the method of e ciently
 storing and using the approximate Hessian matrix.
 95
4.3.1 Convergence
 Identifying when an accurate  t has been achieved is di cult because, while it
 is true that the gradient goes to zero,  nding this exact likelihood maximizer is not
 possible. Knowing that one is su ciently close to a peak in the likelihood function
 is very di cult. Because of this I use two methods to con rm the maximization is
 complete.
 Typically numerical methods are considered to have converged when
 jj k+1   kjj <  ; (4.23)
 for some small value of  . When used in this thesis, even extremely small values
 of  were found to not be su cient for this to be a valid criterion. Re tting the
 data many times from di erent starting locations resulted in large di erences in  
and ?( ) between the  ts. A di erent convergence criteria is needed to assure the
 likelihood function is maximized.
 Instead, the criterion that the gradient be very small is much more useful. I
 require
  
 
 
 
 
 
 
 
 
 
@?( k)
 @ 
 
 
 
 
 
 
 
 
 
 
1
 =
 X
 m
  
 
 
 
 
@?( k)
 @ m
  
 
 
 
 
<  (4.24)
 where the value of  is set to a small value, and a subscript k is used to indicate
 the iteration number on the estimated value of  . However, the value of jj@?( k)@ jj1
 does not decrease monotonically, so while I found this works , it does not guarantee
 96
convergence.
 That this criterion was su cient was checked by solving several of the op-
 timization from  ve distinct starting points to verify that the results were nearly
 identical. In every case, it was observed that decreasing the value of  forced the
 values of  closer together with a typical mean di erence between  estimates being
 0.01 for a relatively large value of  of 100 . That is
 1
 n
 X
 i
 j i   
0
 ij < 0:01 (4.25)
 where  2 IRn,  i is the ith value of  for one estimate and  0i is the ith value of  
for an estimate solved from a separate starting location.
 4.4 Discussion and Conclusion
 When estimating a decomposition of grades into student and class  xed e ects,
 the best estimator is not a priori obvious. The OLS estimator, conditional on its
 assumptions, should produce accurate estimates. The OMM has more palatable
 assumptions but may not be an accurate estimator. Despite their shortcomings,
 simulations showed that the OMM and OLS do a good job of estimating grades
 decomposition for cases similar to those in the data considered herein.
 One additional issue is that consistency of the likelihood ratio tests assumes
 that the number of  xed e ects does not grow with the number of observations.
 This is not true in the case of student and class  xed e ects where a  rst order bias
 is induced. The jackknife has the appealing properties of removing  rst order bias
 97
in estimates and being estimable relatively quickly. A second simulation regarding
 signi cance testing shows that when the number of observations per student is large,
 the likelihood ratio tests are not biased towards rejection but that regardless of the
 number of classes per student, the jackknifed likelihood ratio test is approximately
 correct but slightly biased towards rejection. The advantage of the jackknife esti-
 mate of the likelihood ratio test is that it provides equal levels of protection for large
 and small numbers of observations per student and is thus robust to small numbers
 of observations per student in a way that the non-jackknife likelihood ratio test is
 not.
 These models are also di culty to estimate. Sparse matrix techniques did not
 provide su cient simpli cation but L-BFGS appears to be able to readily estimate
 these models when there are a large number of  xed e ects for students and classes.
 I also found that a new criterion must be used for convergence{the total gradient,
 not the length of the last update, must be small. When this criterion is used,
 convergence is con rmed much more readily.
 This class of model is estimable, the correlations can be trusted from OLS or
 OMM estimators, and likelihood ratio tests can be trusted when using the jackknife
 (knowing it is consistently somewhat under-conservative) or when there are more
 than about thirty observations per student.
 98
Table 4.1: Simulation inputs.
 Variable \Uncor" \Cor" \High GPA"
 avg. student GPA 3.02 3.10 3.82
 E(a) E(d) 2.67 2.67 4.29
 Var(a) 1.0 1.0 1.0
 Var(d) 1.0 1.0 1.0
 Cov(a; d) 0 0.7 0
 bA=B 3 3 3
 bB=C 2 2 2
 bC=D 1 1 1
 bD=F 0 0 0
 nstud 32,000 32,000 32,000
 classes per student 40 40 40
 students per class 32 32 32
 observations 1,280,000 1,280,000 1,280,000
 Note: a and d are bivariate normally distributed, and the parameters of
 their distribution is a su cient statistic for the average student GPA.
 99
Table 4.2: Results of simulations.
 \Uncor" \Cor" \High GPA"
 True OME OLS True OME OLS True OME OLS
  ai 1:00 1:13 0:66 1:01 1:04 0:80 1:00 1:96 0:33
  dj 0:79 0:81 0:51 0:73 0:74 0:57 0:79 0:84 0:23
   di 0:12 0:13 0:08 0:67 0:68 0:52 0:12 0:13 0:04
   Ga;G 0:51 0:52 0:51 0:24 0:27 0:27 0:59 0:62 0:62
   Gd;G 0:42 0:42 0:42  0:14  0:14  0:14 0:48 0:48 0:48
   Ga;d 0:00  0:01  0:01  0:74  0:72  0:72 0:00  0:02  0:02
  ai;GPAi 0:95 0:91 0:99 0:87 0:93 0:93 0:85 0:65 0:99
   di;GPAi 0:12 0:12 0:12  0:83  0:83  0:84 0:10 0:10 0:11
  ai;  di 0:00 0:01 0:00  0:99  0:98  0:98 0:00 0:04 0:00
 Note: Top panel shows standard deviations of student ability (ai), class di culty (dj), and average class
 di culty for students (  dj); the middle panel shows the Goodman-Kruskal correlation coe cients (  G)
 between student ability ai, grade (G) and student average class di culty  di; the bottom panel shows the
 Pearson correlation between student GPA (GPAi), student ability, and student average class di culty.
 Table 4.3: Results of simulations, 4.0s removed.
 \Uncor" \Cor" \High GPA"
 True OME OLS True OME OLS True OME OLS
  ai 0:98 1:02 0:66 1:00 1:04 0:80 0:86 0:87 0:34
  dj 0:79 0:81 0:51 0:73 0:74 0:57 0:79 0:84 0:23
   di 0:12 0:13 0:08 0:67 0:68 0:52 0:12 0:13 0:04
  ai;GPAi 0:95 0:96 0:99 0:87 0:93 0:93 0:88 0:92 0:99
   di;GPAi 0:12 0:12 0:12  0:83  0:83  0:84 0:10 0:09 0:10
  ai;  di 0:00 0:00 0:00  0:99  0:98  0:98  0:03  0:04  0:01
 Note: For a description of the statistics, see Table 4.2.
 100
Table 4.4: Fraction of simulations with p-values smaller
 than 0.10 for a  2 test for inclusion of  xed e ects.
 LR test Jackknife LR test
 Individual (i) FEs
 I n =1,000 2,000 4,000 1,000 2,000 4,000
 8 0.694 0.625 0.586 0.076 0.072 0.028
 16 0.150 0.101 0.058 0.105 0.083 0.068
 32 0.076 0.021 0.011 0.134 0.095 0.050
 Class (j) FEs
 I n =1,000 2,000 4,000 1,000 2,000 4,000
 8 0.728 0.653 0.564 0.126 0.167 0.155
 16 0.242 0.198 0.178 0.175 0.163 0.168
 32 0.075 0.097 0.072 0.142 0.194 0.155
 Note: For any cell, an ideal result is 0.10. Results higher than this
 indicate too many rejections of the null hypothesis while results
 lower than this indicate too few rejections of the null hypothesis.
 This is based on a model Yij = ai + dj +  ij analyzed using the
 ordered logit. In every case, the null hypothesis is true, that is
 there is no e ect of ai nor dj on Yij . Because of this, p-values
 smaller than 0.10 should happen about 10% of the time, and an
 ideal value on this table is 0.10. The top portion shows results for
 removing  xed e ects for n individuals when there are T (left most
 column) samples taken for each individual. The bottom portion
 shows results for removing 500  xed e ects for the classes (dj).
 Both tests show substantial bias for low T with increases in n only
 slightly mitigating this e ect. However, once T is as large as 32,
 there is little bias. In these cases the p-value is small so the test is
 actually conservative. For the jackknife, the test is often slightly
 positively biased, but the result does not systematically depend on
 T or n.
 101
Figure 4.1: The relationship between ability, observed ability, and grades.
 ability
 a1 a2
 observed
 ability
 a1c a2c
 ?1 ?2
 grade F D C B A
 Note: Two students with ability a1 and a2 (top scale) have their ability observed plus an
 error term (a1c =  1c, a2c =  2c; middle scale) and are awarded discrete letter grades based
 on this observation (?F? and ?A?, respectively; bottom scale).
 Figure 4.2: Type of variation the estimator allows between classes.
 observed
 ability
 F D C B Aclass A
 F D C B Aclass B
 Note: The estimator used allows grades to vary by allowing each class to shift every letter
 grade up (or down) by the same amount. Here class B is slightly harder and so shifts every
 grade to the right by the same amount (represented by the arrows).
 Figure 4.3: An example of the likelihood of a particular grade given observed ability.
 ability
 a1
 density of
 observed
 ability
 grade F D C B A
 Note: The probability that a student with ability a1 will receive a ?D? grade is proportional
 to the shaded area.
 102
Figure 4.4: Simulated performance expan-
 sion path for the \uncor" simulation.
 ?0.04 0.00 0.02 0.04
 1.
 5
 2.
 0
 2.
 5
 3.
 0
 3.
 5
 4.
 0
 difficulty
 G
 PA
 G
 G
 G
 G
 G
 G
 G
 G
 G
 GG
 GG
 G G
 GG
 GG
 GG
 G
 true
 OME
 OLS
 Note: Shows the \true" simulated results
 (black plusses), OME (red circles), and OLS
 (green triangles). Students with perfect 4.0
 GPAs are separated into their own group (top
 left most point).
 Figure 4.5: Simulated performance expan-
 sion path for the \cor" simulation.
 ?1.0 0.0 0.5 1.0 1.5
 2.
 5
 3.
 0
 3.
 5
 4.
 0
 difficulty
 G
 PA
 G
 G
 G
 G
 GG
 GG
 GGGG
 GG
 GG
 G
 G
 G
 G
 G
 G
 true
 OME
 OLS
 Note: Shows the \true" simulated results
 (black plusses), OME (red circles), and OLS
 (green triangles). Students with perfect 4.0
 GPAs are separated into their own group (top
 right most point).
 Figure 4.6: Simulated performance expansion path for the \High GPA" simulation.
 ?0.02 0.00 0.02 0.04 0.06
 2.
 6
 3.
 0
 3.
 4
 3.
 8
 difficulty
 G
 PA
 G
 G
 G
 G
 G
 G
 G
 GG
 GG
 GG
 GG G
 GG G
 G
 G
 true
 OME
 OLS
 Note: Shows the \true" simulated results (black plusses), OME (red circles), and OLS (green
 triangles). Students with perfect 4.0 GPAs are separated into their own group (top left most
 point).
 103
Appendix A
 Data
 The data for this thesis were provided by the O ce of Institutional Research,
 Planning and Assessment (OIRPA) as two  les: a transcript  le and a application
 demographics  le. On the transcript  le, an observation is a single class that a
 single student took. So, for example, a student who enrolled in 24 classes would
 have 24 observations in the transcript  le, each of which would contain the student?s
 identi er, the class identi er and the resulting grade as well as several other variables
 (Table A.1). The  le includes all undergraduate transcripts from the University of
 Maryland from Summer 2003 to Fall 2010. The transcript  le is based on data taken
 from the registrar?s o ce two weeks after the end of the semester and so these are
 not necessarily the grades that would appear on students? transcript because grades
 can change after this point.1
 The demographic data has columns shown on Table A.3. There were some
 duplicates of newid but this was a technical problem because the values on the
  le were always identical. When duplicate records exist all of the records do not
 necessarily contain all of the information for the student, for example, an individual
 might have columns missing or null on some rows regarding that individual that are
 not missing on other rows regarding that individual. However, when data is present
 1While I have no method of con rming this, the registrar?s o ce said in an interview that
 grades do change after this point but it is not the common.
 104
on two rows, that data is always identical. Because of this, all of the data from all
 rows is copied onto a single row and all other rows are removed.
 The transcript and demographics  les can be linked by students using the
 newid variable. This is an identi er generated by OIRPA for the purpose of this
 work, unique to a student and distinct from the social security number and university
 ID for the purpose of maintaining student anonymity.
 The grades are not used exactly as they appear on the  le. The letter grades
 are mapped to grade points per the University of Maryland standard (which does
 not use the plusses and minuses). Since the OME needs only assume that the values
 are ordinal, they are included in this speci cation. In addition, there are several
 things that can happen besides of award of a letter grade. All possible values of
 crs grade are shown in Table A.2 along with how they are used.
 Construction of degree, enter, and full samples
 The data contains three extracts: \degree", \enter", and \full" which represent
 di erent levels of  ltering the students. Some criteria were applied to all three
 datasets.
 The following describes these criteria and the order in which they were applied,
 which impacts the  nal sample. Graduate classes (course number above 500) were
 removed because they are not typical college classes. Internships (course number 386
 and 387) were removed because they are not typical college classes. Transcript ob-
 servations where there is no letter grade were removed (as described above). Classes
 that did not award two \interior" grades (?B+? or lower and ?C-? or higher) were
 also excluded because they were essentially uninformative. Most of these classes
 105
awarded only the ?A?s or ?F?s but a small number awarded both ?A?s and ?F?s but
 not grades between the two. These classes appear to use a version of pass fail that
 is inconsistent with the model used for grades. Students who earned fewer than two
 interior grades were removed from the data. This was done because non-interior
 grades carry very little information and so these students  xed e ects were not well
 estimated. Because the information content of their grades is low, the e ect of
 removing them on other students or classes is small or zero. Courses that had a
 total of 5 students or fewer over the eight year sample were also removed because
 of their unusual nature. In addition, students with fewer than  ve courses on the
 data set were removed. Finally, students who were under about 18 or over about
 25 at entrance were removed.2 Students in both of these groups (under 18 and over
 25) were observed to take substantially di erent courses than students within this
 age range. In particular, the students in the age extremes tended to be much more
 focused on a particular department than students between 18 and 25. The number
 of transcript observations removed by each of these criteria is shown in Table A.4.
 After applying these  lters, an observation would be counted as part of the \full"
 sample.
 The \enter" sample is the \full" sample after dropping students who entered
 after August of 2005 (and thus had fewer than  ve years to complete their degree)
 and dropping those who entered in the  rst observed term (Summer of 2003) with
 initial credits. Finally, students must have been degree seeking at some point in
 their career to be in the \enter" sample.
 2These ages are approximate because the age variable is simply an integer with no \as of" date.
 106
The \degree" sample is the \enter" sample with the additional criterion that
 student must appear to have graduated. In particular, the student must have 120
 total credits (enrolled plus otherwise appearing on the transcript), or have enrolled
 for at least 8 terms. These criteria probably include a larger set than those students
 who graduated because students might enroll in 8 terms or 120 credits but still not
 complete a degree.
 Construction of derived variables
 Several derived variables are used from these dataset and the following de-
 scribes their construction.
 Each student?s GPA is calculated as follows:
 GPAi =
 P
 Ai GPij  crij
 P
 Ai crij
 where GPij is the grade points (from Table A.2) for student i in class j, and crij is
 the number of credits student i was enrolled in for class j, and Ai is the set of all
 classes student i received a non-dropped grade in. The set Ai thus excludes classes
 that did not meet the selection criteria for the \full" sample (see above), even if a
 valid student enrolled in that class.
 The following demographic variables were extracted from the transcript data
 by extracting the  rst (chronologically) time the student appears on the transcript
 data: the  rst term of enrollment, the number of previous credits at the time of
 enrollment, the age at entry. From the last term of enrollment, the  nal major and
  nal term of enrollment are captured and recorded as demographic data.
 107
The following demographic variables were extracted from the transcript data
 using all of the transcript data for an individual student: the number of terms (the
 sum of terms where the student is enrolled), the number of classes (the sum of the
 number of classes the student was enrolled in), if the student was ever listed as
 degree seeking (if any of the ever deg seek values were TRUE), if the student was
 ever enrolled (if any of the ever enrolled values were TRUE).
 Class year is calculated each semester using the total number of credits at
 the beginning of the semester (last cum cr earn ug, but called cr in the following
 equation):
 class year =
 8
 >>>>>>>>>>>>><
 >>>>>>>>>>>>>:
 freshman if cr < 30
 sophomore if 30  cr < 60
 junior if 60  cr < 90
 senior if 90  cr
 (A.1)
 For the regressions, total credits is the sum of crs credits on the full dataset
 by semester. Registered credits is the sum of crs credits where the grading method
 is \Regular," meaning that the outcome was intended to be a grade. Examples where
 the outcome is not a grade for a class for which the grading method is\regular" are
 shown in Table A.2. Every value in that table except \audit" is possible when the
 grading method is \Regular."
 108
Table A.1: Transcript columns.
 Variable name Description
 newid student ID generated by OIRPA
 term numeric term (ex: \200508" for Spring of 2005)
 term transl a textual term description (ex: \Spring 2005")
 course UMD course number (ex: \ENGL101")
 section section number
 crs grade the grade awarded in the course
 crs credit number of course credits
 matric entry stat ug matriculation status
 crs grd meth cd the course grading method code (ex: \R" for regular)
 race citz cd race / citizen status code
 race citz text description of race /citizen code
 last cum cr earn ug college cumulative credits before the term in question
 stu campus code the student?s campus code
 ug gr lev undergraduate level
 official enrolled ind indicator for if the student is o cially enrolled
 major the  rst major of the student when taking this class
 student type an indicator of participation in certain programs
 deg seeking ind indicator of if the student is degree seeking (matricu-
 lated)
 cls stand prior unknown
 coll adv the college (ex: \BSOS", the college the Economics de-
 partment is in.)
 last cum gpa ug the cumulative GPA prior to enrollment
 age the student?s age
 gender cd the gender of the student
 Note: Some of these variables are not used, such as student type and section.
 109
Table A.2: Transcript columns.
 Grade GP value OME value
 A+ 4 A/A+
 A 4 A/A+
 A- 4 A-
 B+ 3 B+
 B 3 B
 B- 3 B-
 C+ 2 C+
 C 2 C
 C- 2 C-
 D+ 1 D
 D 1 D
 D- 1 D
 F 0 F
 withdraw completely 0 F
 withdraw 0 F
 academic dishonesty 0 F
 pass (when taken pass/fail) observation dropped
 satisfactory (when taken satisfactory/fail) \
 fail (when taken pass/fail or satisfactory/fail) \
 missing \
 incomplete \
 audit \
 Note: Withdraw completely means that a student withdrew from all his or her classes.
 Withdraw means that a student withdrew from a single class.
 110
Table A.3: Application / demographic columns.
 Variable name Description
 newid a student ID generated by OIRPA
 zip5 the zip code from the student?s permanent address
 sat high verbal SAT verbal score
 sat high math SAT math score
 act high englsih ACT english score
 act high math ACT math score
 act high reading ACT reading score
 act high science ACT science score
 act high composite ACT composite score
 sat recentered if the SAT score is recentered (all are)
 sat recentered cd numeric code for previous
 hs acad gpa high school academic GPA
 weighted gpa ind unknown
 high school name of the high school attended
 high school cd numeric code for previous
 hs class rank pct high school class rank as a percentage
 transfer gpa GPA at transfer institution
 last trans inst name of the last institution
 last trans inst cd numeric code for previous
 Note: All variables are as of application.
 111
Table A.4: Observations removed by sample selection criteria.
 Criteria n remaining n removed
 no  lters 1,840,212 |
 graduate classes 1,836,364 3,848
 internships 1,831,622 4,742
 no grade 1,789,063 42,559
 course awarded interior grades 1,776,799 12,264
 student earned interior grades 1,685,971 90,828
 course ever had more than  ve enrollees 1,685,345 626
 student ever enrolled in  ve classes 1,675,859 9,486
 age 18 to 25 1,621,707 54,152
 \full" sample 1,621,707 |
 \enter" sample 655,570 966,137
 \degree" sample 523,151 132,419
 Note: An observation is an individual transcript entry (ex: Sam takes organic chem-
 istry and gets a ?B?). The bottom section shows the number of observations in each of
 the samples. The number removed column is the di erence between the n remaining
 column from that line and the one above it.
 112
Bibliography
 Abowd, J. M., Creecy, R. H., & Kramarz, F. (2002). Computing person and  rm
 e ects using linked longitudinal employer-employee data. Availble on the author?s
 website.
 Angrist, J. D., Lang, D., & Oreopoulos, P. (2009). Incentives and services for college
 achievement: Evidence from a randomized trial. American Economic Journal:
 Applied Economics , 1 (1), 136{63.
 Arcidiacono, P., Foster, G., Goodpaster, N., & Kinsler, J. (2011). Estimating
 spillovers using panel data, with an application to the classroom. Available on
 the authors? website.
 Babcock, P. S., & Marks, M. (2010). The falling time cost of college: Evidence
 from half a century of time use data. Working Paper 15954, National Bureau of
 Economic Research.
 Bates, D., & Maechler, M. (2011). Matrix: Sparse and Dense Matrix Classes and
 Methods .
 Ben-Porath, Y. (1967). The production of human capital and the life cycle of
 earnings. The Journal of Political Economy , 75 , 352{365.
 Bettinger, E. P., Evans, B. J., & Pope, D. G. (2011). Improving college performance
 and retention the easy way: Unpacking the ACT exam. Working Paper 17119,
 National Bureau of Economic Research.
 Betts, J. R., & Morell, D. (1998). The determinants of uncergraduate grade point
 average. Journal of Human Resources , 34 (2), 268{288.
 Bliss, C. I. (1935). The calculation of the dosage-mortality curve. Annals of Applied
 Biology , 22 (1), 134{167.
 Broyden, C. G. (1970). The convergence of a class of double-rank minimizaiton
 algorithms. Journal of the Institute of Mathematics and Its Applications , 6 (1),
 76{90.
 Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of
 Econmic Studies , 47 (1), 225{238.
 Cohn, E., Cohn, S., Balch, D. C., & Bradley Jr., J. (2004). Determinants of un-
 dergraduate GPAs: SAT scores, high-school GPA and high-school rankscores,
 high-school GPA and high-school rank. Economics of Education Review , 23 ,
 577{586.
 College Board (2009). Total group pro le report: Total group. Tech. rep., College
 Board.
 113
Dale, S. B., & Krueger, A. B. (2002). Estimating the payo to attending a more
 selective college: An application of selection on observables and unobservables.
 The Quarterly Journal of Economics , 117 (4), 1491{1527.
 DeSimone, J. S. (2008). The impact of employment during school on college stu-
 dent academic performance. Working Paper 14006, National Bureau of Economic
 Research.
 Dorans, N. J. (2002). Recentering and realigning the SAT score distributions: How
 and why. journal of Educational Measurement , 39 (1), 59{84.
 Drew, C. (2011). Why science majors change their mind (it?s just so darn hard).
 The New York Times , November 4 .
 Duncan, G. T. (1978). An empirical study of jackknife-constructed con dence re-
 gions in nonlinear regression. Technometrics , 20 (2), 123{129.
 Eaton, B. C., & Eswaran, M. (2008). Di erential grading standards and student
 incentives. Canadian Public Policy , 34 (2), 215{236.
 Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans .
 Philedelphia, Pennsylvania: Society for Industrial and Applied Mathematics.
 Fernandez-Val, I. (2009). Fixed e ects estimation of structural parameters and
 marginal e ects in panel probit models. Journal of Econometrics , 150 , 71{85.
 Ferrer-i-Carbonell, A., & Frijters, P. (2004). How important is methodology for the
 estimates of the determinants of happiness? The Economic Journal , 114 (497),
 641{659.
 Fletcher, R. (1970). A new approach to variable metric algorithms. The computer
 journal , 13 (3), 317{322.
 Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models . Los
 Angeles, CA: Sage, 2nd edition ed.
 Freeman, D. G. (1999). Grade divergence as a market outcome. Journal of Economic
 Education, 30 (44), 344{351.
 Geiser, S., & Studley, R. (2001). UC and the SAT: Predictive validity and di erential
 impact of the SAT I and SAT II at the university of california. Tech. rep.,
 University of California O ce of the President.
 Ghent, A. W. (1984). Examination of  ve tau variants suited to ordered contingency
 tables, from the viewpoint of biological research. American Midland Naturalist ,
 112 (2), 332{268.
 Goldfarb, D. (1970). A family of variable metric updates derived by variational
 means. Mathematics of Computation, 24 (109), 23{26.
 114
Green, W. (2002). The bias of the  xed e ects estimator in nonlinear models.
 Available on the author?s website.
 URL http://pages.stern.nyu.edu/ wgreene/
 Griliches, Z. (1977). Estimating the returns to schooling: Some econometric prob-
 lems. Econometrica, 45 (1), 1{22.
 Grove, W. A., & Wasserman, T. (2003). The life-cycle pattern of collegiate GPA:
 Longitudinal cohort analysis and grade in ation. Journal of Economic Education,
 35 (2), 162{174.
 Hahn, J., & Newey, W. (2004). Jackknife and analytical bias reduction for nonlinear
 panel models. Econometrica, 72 (4), 1295{1319.
 Hamilton, B. W. (1975). Zoning and property taxation in a system of local govern-
 ments. Urban Studies , 12 , 205{211.
 Hanushek, E. A. (2006). School resources. In E. A. Hanushek, & F. Welch (Eds.)
 Handbook of Economics of Education, vol. 2, chap. 14, (pp. 865{908). Elsevier.
 Heckman, J. J. (1981). The incidental parameters problem and the problem of
 initial conditions in estimating a discrete time - discrete data stochastic process. In
 C. Manski, & M. D. (Eds.) Structural Analysis of Discrete Data with Econometric
 Applications . MIT Press.
 Jing, B.-Y., Yuan, J., & Zhou, W. (2009). Jackknife empirical likelihood. Journal
 of the American Statistical Association, 104 (487), 1224{1232.
 Jones, E. B., & Jackson, J. D. (1990). College grades and labor market rewards.
 Journal of Human Resources , 25 , 253{266.
 Klopfenstein, K., & Thomas, M. K. (2009). The link between Advanced Placement
 experience and early college success. Southern Economic Journal , 75 (3), 873{891.
 Leonhardt, D. (2011). Top colleges, largely for the elite. New York Times .
 Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large
 sample optimization. Mathematical Programing , 45 , 503{528.
 Loury, L. D., & Garman, D. (1995). College selectivity and earnings. Journal of
 Labor Economics , 13 , 289{308.
 Mankiw, G. (2011). A regression I would like to see. Part of Greg Mankiw?s blog:
 Random Observations for Students of Economics.
 URL http://gregmankiw.blogspot.com/2011/05/regression-i-would-like-to-see.html
 Matlo , N. S. (1980). Algorithm as 148: The jackknife. Journal of the Royal
 Statistical Society. Series C (Applied Statistics), 29 (1), 115{117.
 115
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models . Chapman &
 Hall/CRC.
 Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal
 of the Royal Statistical Society. Series A (General), 135 (2), 370{384.
 Oates, W. E. (2005). The many faces of the tiebout model. In W. A. Fischel (Ed.)
 The Tiebout Model at  fty: Essays in public Economics in Honor of Wallace
 Oates , (pp. 21{45). Lincoln Institute of Land Policy.
 Owen, A. B. (2001). Empirical Likelihood . No. 92 in Monographs in Statistics and
 Applied Probability. Chapman & Hall/CRC.
 Rothstein, J. M. (2004). College performance prediction and the SAT. Journal of
 Econometrics , 121 (1-2), 297{317.
 Sabot, R., & Wakeman-Linn, J. (1991). Grade in ation and course choice. Journal
 of Economic Perspectives , 5 (1), 159{170.
 Shanno, D. F. (1970). Conditioning of quasi-newton methods for function minimiza-
 tion. Mathematics of Computation, 24 (111), 647{656.
 Stinebrickner, R., & Stinebrickner, T. R. (2003). Understanding educational out-
 comes of students from low-income families: Evidence from a liberal arts college
 with a full tuition subsidy program. Journal of Human Resources , 38 (3), 591{617.
 U.S. Census Bureau (2001). Census 2000 summary  le 3. Tech. rep., U.S. Census
 Bureau.
 Wainer, H. (1986). Five pitfalls encountered when trying to compare states on their
 SAT scores. journal of Educational Measurement , 23 (1), 69{81.
 Wolter, K. M. (1985). Introduction to Variance Estimation. Springer Series in
 Statistics. New York: Springer.
 Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778. L-BFGS-B:
 FORTRAN subroutines for large-scale bound constrained optimization. ACM
 Transactions on Mathematical Software, 23 (4), 550{560.
 116