Methods

How SEDA produced estimates of student performance that are comparable across places, grades, and years.

A simplified description of the SEDA methodology is provided here; for more detail, see the SEDA Technical Documentation.

Background

Federal law requires all states to test all students in grades 3-8 each year in math and ELA (commonly called “accountability” testing). It also requires that states make the aggregated results of those tests public.

All states report test results for schools and school districts to the Department of Education’s National Center for Education Statistics (NCES), which houses this data in the EDFacts database. There are two versions of this data: a public version, available on their website, and a restricted version, available by request. For security, the public data are censored. That is, data for small places or small subgroups are not reported to ensure the privacy of the students. In contrast, the restricted files contain all data, regardless of the number of students within the school/district or subgroup.

Challenges Working with Proficiency Data

While there is a substantial amount of data from every state available in EDFacts, there are four key challenges when using these data:

  1. States provide only “proficiency data”: the count of students at each of the proficiency levels (sometimes called “achievement levels” or “performance levels”). The levels represent different degrees of mastery of the subject-specific grade-level material. Levels are defined by score “thresholds,” which are set by experts in the field. Scoring above or below different thresholds determines placement in a specific proficiency level. Common levels include “below basic,” “basic,” “proficient,” and “advanced.” An example is shown below.

    Test Score Proficiency Level Description
    200-500 Below Basic Inadequate performance; minimal mastery
    501-600 Basic Marginal performance; partial mastery
    601-700 Proficient Satisfactory performance; adequate mastery
    701-800 Advanced Superior performance; complete mastery

    Scale of test scores relative to grade level expectations, proficiency cutoff at 600

  2. Most states use their own test and define “proficiency” in different ways, meaning that we cannot directly compare test results in one state to those in another. Proficient in one state/grade/year/subject is not comparable to proficient in another.

    Consider two states that use the same test, which is scored on a scale from 200 to 800 points. Each state sets its own threshold for proficiency at different scores.

    State A: Higher threshold for proficiency

    Scale of test scores relative to grade level expectations, proficiency cutoff at 600

    State B: Lower threshold for proficiency

    Scale of test scores relative to grade level expectations, proficiency cutoff at 500

    Imagine 500 students take the test. The results are as follows: 50 students score below 400 on the exam; 100 score between 400 and 500; 200 score between 500 and 600; 50 score between 600 and 650; 50 score between 650 and 700; and 50 score above 700. If we use State A’s thresholds for assignment to categories, we find that 150 students are proficient. However, if we use State B’s thresholds, 350 students are proficient.

    Not Proficient Proficient
    State Level 1 Level 2 Level 3 Level 4
    A 150 200 100 50
    B 50 100 250 100
    In practice, this means that students in State B may appear to have higher “proficiency” rates than those in State A—even if their true achievement patterns are identical! Using the proficiency data without accounting for differing proficiency thresholds may lead to erroneous conclusions about the relative performance of students in different states.

    This problem is more complicated than the example suggests, because most states use different tests with material of varying difficulty and scores reported on different scales. Therefore, we cannot compare proficiency, nor can we compare students’ test scores between states.

  3. Even within a state, different tests are used in different grade levels. This means that we cannot readily compare the performance of students in 4th grade in one year to that of students in 5th grade in the next year. Therefore, we cannot measure average learning rates across grades.

  4. States may change the tests they use over time. This may result from changes in curricular standards; for example, the introduction of the Common Core State Standards led many states to adopt different tests. These changes make it hard to compare average performance in one year to that of the next. Therefore, we cannot readily measure trends in average performance over time.

SEDA methods: Addressing the challenges

While these challenges are substantial, they are not insurmountable. The SEDA team has developed methods to address these challenges in order to produce estimates of students’ average test scores, average learning rates across grades, and trends over time in each school, district, and county in the U.S. All of these estimates are comparable across states, grades, and years.

Below we describe the raw data used to create SEDA and how we:

  1. Estimate the location of each state’s proficiency “thresholds”
  2. Place the proficiency thresholds on the same scale
  3. Estimate the mean test scores in each school, district, and county from the raw data and the threshold estimates
  4. Scale the estimates so they are measured in terms of grade levels
  5. Create estimates of average scores, learning rates, and trends in average scores.
  6. Report the data
  7. Ensure Data Accuracy

Raw data

The data used to develop SEDA come from the EDFacts Restricted-Use Files, provided to our team via a data-use agreement with NCES.

EDFacts provides counts of students scoring in proficiency levels for nearly every school in the U.S. For details on missing data, see the SEDA Technical Documentation.

Data are available for:

  • School years 2008-09 through 2015-16
  • Grades 3-8
  • Math and English Language Arts (ELA)
  • Various subgroups of students: all students, racial/ethnic subgroups, gender subgroups, economic subgroups, etc.

The data also contain NCES identifiers, which allow the information to be linked to other databases or combined into different aggregations (e.g., school districts, counties, etc.).

Below is a mock-up of the data format:

Number of Students Scoring at
NCES School ID State FIPS Code Subgroup Subject Grade Year (Spring) Proficiency Level 1 Proficiency Level 2 Proficiency Level 3 Proficiency Level 4
1 A All students ELA 3 2009 10 50 100 50
2 A All students ELA 3 2009 5 30 80 40

Estimating the location of each state’s proficiency thresholds

We use a statistical technique called heteroskedastic ordered probit (HETOP) modeling to estimate the location of the thresholds that define the proficiency categories within each state, subject, grade, and year. We estimate the model using all the counts of students in each school district within a state-subject-grade-year.

A rough approximation of this method follows. We assume the distribution of test scores in each school district is bell-shaped. For each state, grade, year, and subject, we then find the set of test-score thresholds that meet two conditions: 1) they would most closely produce the reported proportions of students in each proficiency category; and 2) they represent a test-score scale in which the average student in the state-grade-year-subject has a score of 0 and the standard deviation of scores is 1.

Example: State A, Grade 4 reading in 2014–15

In the example below, there are three districts in State A. The table shows the number and proportion of scores in each of the state’s four proficiency categories. District 1 has more lower-scoring students than the others; District 3 has more higher-scoring students. We assume each district’s test-score distribution is bell-shaped. We then determine where the three thresholds would be located that would yield the proportions of students in each district shown in the table. In this example, the top threshold is one standard deviation above the statewide average score; at this value, we would expect no students from District 1 to score in the top proficiency category, and would expect 20% of those in District 3 to score in the top category.

Distribution table Distribution chart

Placing the proficiency thresholds on the same scale

As discussed above, we cannot compare proficiency thresholds across places, grades, and years because states use different tests with completely different scales and set their proficiency thresholds at different levels of mastery. Knowing that a proficiency threshold is one standard deviation above the state average score does not help us compare proficiency thresholds across places, grades, or years because we do not know how a state’s average score in one grade and year compares to that in other states, grades, and years.

Luckily, we can use the National Assessment of Educational Progress (NAEP), a test taken in every state, to place the thresholds on the same scale. This step facilitates comparisons across states, grades, and years.

A random sample of students in every state takes the NAEP assessment in Grades 4 and 8 in math and ELA in odd years (e.g., 2009, 2011, 2013, 2015, and 2017). From NAEP, then, we know the relative performance of states on the NAEP assessment. In the grades and years when NAEP assessments were not administered to students, we average the scores in the grades and years just before and just after to obtain estimates for untested grades, subjects, and years.

We use the states’ NAEP results in each grade, year, and subject to rescale the thresholds to the NAEP scale. For each subject, grade, and year, we multiply the thresholds by the state’s NAEP standard deviation and add the state’s NAEP average score.

Example: State A, Grade 4 reading in 2014–15

The average score and standard deviation of State A NAEP scores in Grade 4 reading in 2014–15 were:

  • Mean NAEP Score: 200
  • Standard Deviation of NAEP Score: 40

We have three thresholds:

  • Threshold 1: -0.75
  • Threshold 2: 0.05
  • Threshold 3: 1.0

As an example, let’s convert Threshold 1 onto the NAEP scale. First, we multiply by 40. Then, we add 200:

(-0.75 x 40.0) + 200 = 170

This yields a new “linked” Threshold 1 of 170. The table below shows all three linked thresholds.

Threshold Original Linked (on NAEP Scale)
1 -0.75 170
2 0.05 202
3 1.0 240

We repeat this step for every state in every subject, grade, and year. The end result is a set of thresholds for every state, subject, grade, and year that are all on the same scale, the NAEP scale.

For more information, see Step 3 of the Technical Documentation and Reardon, Kalogrides & Ho (Forthcoming).

Estimating the mean from proficiency count data

The next step of our process is to estimate the mean test score in each school and district for all students and by student subgroups (gender, race/ethnicity, and economic disadvantage). To do this, we estimate heteroskedastic ordered probit models using both the raw proficiency count data (shown above) and the linked thresholds from the prior step. This method allows us to estimate the mean standardized test score in a school or district for every subgroup, subject, grade, and year on the same scale.

To get estimates for each county (in a given subject, grade, and year) we take a weighted average of all the estimates from the districts that belong to that county. The weighted average accounts for the number of students served in each district, allowing estimates from large districts in a county to contribute more than small districts.

For more information, see Steps 5 and 6 in the technical documentation; Reardon, Shear, et al. (2017); and Shear and Reardon (2019).

Scaling the Estimates to Grade Equivalents

On the website, we report all data in grade levels, or what we call the Grade (within Cohort) Standardized (GCS) scale. On this scale, users can interpret one unit as one grade level. The national average performance is 3 in Grade 3, 4 in Grade 4, and so on.

To convert our estimates from the NAEP scale into grade levels, we have to first approximate the average amount student test scores grow in a grade on NAEP. To do this, we use data from three national NAEP cohorts: the cohorts who were in 4th grade in 2009, 2011, and 2013. Below we show the average national NAEP scores in Grades 4 and 8 for these three cohorts. We average over the three cohorts to create a stable baseline, or reference group.

Grade 2009 Cohort 2011 Cohort 2013 Cohort Average
Math 4 238.1 239.2 240.4 239.2
8 282.7 280.4 280.9 281.3
Reading 4 217.0 217.8 219.1 218.0
8 264.8 263.0 264.0 263.9

We calculate the amount the test scores changed between 4th and 8th grade (Average 4th to 8th Grade Growth) as the average score in 8th grade minus the average score in 4th grade. Then, to get an estimate of per-grade growth, we divide that value by 4 (Average Per-Grade Growth).

Average 4th Grade Score Average 8th Grade Score Average 4th to 8th Grade Growth Average Per-Grade Growth
Math 239.2 281.3 42.1 10.52
Reading 218.0 263.9 46.0 11.49

Now, we can use these numbers to rescale the SEDA estimates that are on the NAEP scale into grade equivalents. From the SEDA estimates we subtract the 4th-grade average score, divide by the per-grade growth, and add 4.

Example

A score of 250 in 4th-grade math becomes:

(250 – 239.2)/10.52 + 4 = 5.02.

In other words, these students score at a 5th-grade level, or approximately one grade level ahead of the national average (the reference group) in math.

A score of 200 in 3rd-grade reading becomes:

(200 – 218.0)/11.49 + 4 = 2.44.

In other words, these students score approximately half a grade level behind the national average for 3rd graders in reading.

The three parameters: Average test scores, learning rates, and trends in test scores

We use hierarchical linear models to produce estimates of average test scores, learning rates, and trends in average test scores. The intuition behind these models is described in this section.

We have measures of the average test scores in up to 48 grade-year cells in each tested subject for each school, district, or county. The scores are adjusted so that a value of 3 corresponds to the average achievement of 3rd graders nationally, a value of 4 corresponds to the average achievement of 4th graders nationally, and so on. For each subject, these can be represented in a table like this:

Hypothetical Average Test Scores (Grade-level Equivalents), By Grade and Year

Grade 2009 2010 2011 2012 2013 2014 2015 2016
8 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2
7 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1
6 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0
5 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
3 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7

In this hypothetical school district, students in 3rd grade in 2009 earned an average score of 3 in this subject, indicating that students scored at a 3rd-grade level, on average (equal to the national average for 3rd graders). Students in 8th grade in 2016 scored at a Grade 9.2 level, on average (1.2 grade levels above the national average for 8th graders).

From this table, we can compute the average test score, the average learning rate, and the average test score trend for the district.

Computing the average test score

To compute the average test score across grades and years, we first use the information in the table to calculate how far above or below the national average students are in each grade and year. This entails subtracting the national grade-level average—e.g., 8 in 8th grade—from the grade-year-specific score.


Hypothetical Average Test Scores (Grade-level Equivalents Relative to National Average), By Grade and Year

Grade 2009 2010 2011 2012 2013 2014 2015 2016
8 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
7 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
6 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
3 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

In this representation, students in Grade 3 in 2009 have a score of 0, meaning their test scores are equal to the national average for 3rd graders. Students in Grade 8 in 2016 have a score of 1.2, meaning their scores are 1.2 grade levels above the national average for 8th graders.

We then compute the average of these values. In this example, the average difference (the average of the values in the table) is 0.6, meaning that the average grade 3–8 student in the district scores 0.6 grade levels above the national average.

Computing the average learning rate
To compute the average learning rate, we compare students’ average scores in one grade and year to those in the next grade and year (see below). In other words, we look at grade-to-grade improvements in performance within each cohort.

Hypothetical Average Test Scores (Grade-level Equivalents), By Grade and Year
Grade 2009 2010 2011 2012 2013 2014 2015 2016
8 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2
7 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1
6 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0
5 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
3 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7

For example, we compare the average score in Grade 3 in 2009 (3.0) to the average score in Grade 4 in 2010 (4.2). The difference of 1.2 indicates that students’ test scores are 1.2 grade levels higher in 4th grade than they were in 3rd grade, or that students’ learning rate in that year and grade was 1.2. We compute this difference for each diagonal pair of cells in the table, and then take their average. In this table, the average learning rate is also 1.2. If average test scores were at the national average in each grade and year, the average learning rate would be 1.0 (indicating that the average student’s scores improved by one grade level each grade). So, a value of 1.2 indicates that learning rates in this district are 20% faster than the national average.

Computing the trend in average test scores

To compute the average test score trend, we compare students’ average scores in one grade and year to those in the same grade in the next year (see below). In other words, we look at year-to-year improvements in performance within each grade.

Grade 2009 2010 2011 2012 2013 2014 2015 2016
8 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2
7 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1
6 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0
5 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
3 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7

For example, we compare the average score in Grade 3 in 2009 (3.0) to the average score in Grade 3 in 2010 (3.1). The difference of 0.1 indicates that students’ test scores are 0.1 grade levels higher in 3rd grade in 2010 than they were in 3rd grade in 2009. We compute this difference for each horizontal pair of cells in the table, and then take their average. In this example, the average test score trend is 0.1 grade levels per year.

For technical details, see Step 9 of the technical documentation.

Reporting the data

Suppression of Estimates

We do not report average performance, learning, and/or trend estimates if:

  • Fewer than 95% of students in the school, district or county were tested.
  • More than 40% of students in the school, district or county took alternative assessments.
  • If the estimates are too imprecise to be informative.

For more details, see Step 10 of the technical documentation.

Data accuracy

We have taken a number of steps to ensure the accuracy of the data reported here. The statistical and psychometric methods underlying the data we report are summarized here and published in peer-reviewed journals and in the technical documentation.

First, we conduct a number of statistical analyses to ensure that our methods of converting the raw data into measures of average test scores are accurate. For example, in a small subset of school districts, students take the NAEP test in addition to their state-specific tests. Since the NAEP test is the same across districts, we can use these districts’ NAEP scores to determine the accuracy of our method of converting the state test scores to a common scale. When we do this, we find that our measures are accurate, and generally yield the same conclusions about relative average test scores as we would get if all students took the NAEP test. For more information on these analyses, see Validation methods for aggregate-level test scale linking: A case study mapping school district test score distributions to a common scale” paper.

Second, one might be concerned that our learning-rate estimates might be inaccurate, because they do not account for students moving in and out of schools and districts. For example, if many high-achieving students move out of a school or district in the later grades and/or many low-achieving students move in, the average test scores will appear to grow less from 3rd to 8th grade than they should. This would cause us to underestimate the learning rate in a school or district.

To determine how accurate our learning-rate estimates are, we compared them to the estimated learning rate we would get if we could track individual students’ learning rates over time. Working with research partners who had access to student-level data in three states, we determined that our learning-rate estimates are generally sufficiently accurate to allow comparisons among districts and schools. We did find that our learning-rate estimates tend to be slightly less accurate for charter schools. On average, our estimated learning rates for charter schools tend to overstate the true learning rates in charter schools in these three states by roughly 5%. This is likely because charter schools have more student in- and out-mobility than traditional public schools. It suggests that learning-rate comparisons between charter and traditional public schools should be interpreted with some caution.

Third, we have constructed margins of error (“standard errors”) for each of the measures of average test scores, learning rates, and trends in average scores. These standard errors can be used in statistical analyses and comparisons. Interested users can download data files that include these standard errors from our Get the Data page.

Fourth, we do not show any estimates on the website where the margin of error is large. In places where there are a small number of students (or a small number of students of a given subgroup), the margin of error is sometimes large; we do not report data in such cases. Margins of error of school learning rates are also large when there are only two or three grade levels in a school; as a result, roughly one-third of schools are missing learning rates on the website. Note that estimates for all schools—even those with large margins of error—are in the downloadable data files for researchers.

For interested readers and data-file users

For more detail on how we construct the estimated test score parameters using the methods described here, please see the technical documentation. For more detail on the statistical methods we use, as well as information about how accurate the estimates are, please see our technical papers.