Comments on a Report by Jay P. Greene, Marcus A. Winters, and Greg Forster

Manhattan Institute Civic Report No. 33 (February 2003)
Testing High Stakes Tests: Can We Believe the Results of Accountability Tests?
Comments by Bas Braams, New York University

The report Testing High Stakes Tests: Can We Believe the Results of Accountability Tests? by Jay Greene, Marcus Winters and Greg Forster of the Manhattan Institute for Policy Research appeared in February, 2003, and received good press. The key assumption in the report is that if the results on a high stakes test and on a respected low stakes test are well correlated then the high stakes test is valid. Greene et al. therefore confuse the predictive power of the high stakes test with its validity as a measure of student learning. This comment is an attempt to clarify the issue and to see what remains of the conclusions of the Greene, Winters and Forster report.

The study's aim and approach is described as follows in the report's Executive Summary.

Do standardized tests that are used to reward or sanction schools for their academic performance, known as "high stakes" tests, effectively measure student proficiency? Opponents of high stakes testing argue that it encourages schools to "teach to the test," thereby improving results on high stakes tests without improving real learning. [...]

This report tackles that important policy issue by comparing schools' results on high stakes tests with their results on other standardized tests that are not used for accountability purposes [...]

The report finds that score levels on high stakes tests closely track score levels on other tests, suggesting that high stakes tests provide reliable information on student performance. When a state's high stakes test scores go up, we should have confidence that this represents real improvements in student learning. If schools are "teaching to the test," they are doing so in a way that conveys useful general knowledge as measured by nationally respected low stakes tests. [...]

The lead author, Jay Greene, also summarized his study in a presentation at the American Enterprise Institute. At that AEI event he introduced his approach in these words.

[...] I have developed a new approach to gauge the validity of high-stakes tests. My study involves comparing student scores on high-stakes exams to the scores of the same students on tests that are well-respected but that have no consequences attached to them and therefore lack an incentive to distort results. If student scores on these "low-stakes" tests correlate with their scores on the high-stakes tests, then we can say with some confidence that the high-stakes assessments are giving us an accurate picture of student achievement.

Before proceeding to look at the study proper let us reflect a bit on that summary and introduction.

The study's aim is certainly appropriate and timely. If a state or a school district administers a high stakes testing program (high stakes for schools, for teachers, for students, or perhaps for all of the above) then the public will want to know that the tests provide a proper measure of student learning and that, within reason, teaching to this test is compatible with good instruction. For this to be true the tests should measure a broad spectrum of learning and should measure students' knowledge and abilities at a level appropriate to the grade level of the tested population.

In order to evaluate a state's or district's high stakes testing program with that in mind, my strong inclination would be to look first at the test content. The study by Greene et al. instead takes a statistical approach as indicated above, comparing the results on the high stakes test with those on some respected national test that is administered under low stakes.

There is an immediate and obvious problem with that approach that has not been noted in the Executive Summary or in the AEI event presentation and that is barely addressed in the study proper. That problem is that if one takes any two tests that both measure something related to intellectual ability or academic achievement then the scores on these two tests will be quite strongly positively correlated. Give students a math test and a history test, or a test of general intelligence and a geography test, and the results will be positively correlated. In the present case a particularly strong correlation between the high stakes and the low stakes tests should be expected a priori, as in each state or district the two tests considered by Greene et al. both aim to measure scholastic achievement in the same subject or subjects.

Such a strong positive correlation, when it is found, can be interpreted to say that the results on the high stakes test are a good predictor for the results on the low stakes test. It does not imply that the results on the high stakes test are a good measure of student learning or of the performance of teachers or schools and it does not imply that "teaching to the test" for the high stakes test is compatible with good instruction or is even transferable to the low stakes test. A few examples will clarify this.

Consider an 11th grade situation. Suppose that the low stakes test is a battery of six respected SAT-II subject tests including at least a mathematics subject test, an english composition subject test, a history subject test, and a science subject test. Suppose that the high stakes test is the traditional SAT-I math and verbal aptitude test. We expect a strong correlation between the two, and indeed, we can find relevant correlations in, for example, College Board Research Report No. 2002-6, The Utility of the SAT I and SAT II for Admissions Decisions in California and the Nation. According to tables 5-7 the correlation between SAT-I verbal and SAT-II Writing and Literature is 0.83, between SAT-I math and SAT-II Math IC and Math IIC is 0.77, and between SAT-I overall and the combination of SAT-II Writing and Math IC is 0.87. These are correlations at the student level; when the data are averaged over classrooms or schools (as are the data in the Greene, Winters and Forster report) the correlations would be stronger still.

So would the SAT-I be a fine choice as a State's high stakes testing instrument? I think it obvious that it would be a very poor choice. To be sure, the SAT-I is a fine predictor of student performance on the respected subject tests, but it is a poor direct measure of student learning. It is an aptitude test rather than an achievement test. Teaching to the test aimed at the SAT-I would be very impoverished teaching.

In our hypothetical example where the SAT-I serves as the high stakes test and a battery of SAT-II subject tests make up the low stakes test, would we be able to recognize that "teaching to the test" is taking place by looking at the correlations? I would not think so. The correlations are there in any case. Teaching towards the SAT I will improve student scores on the SAT I, but the scores will still be highly correlated with those on the SAT II, and I wouldn't venture to guess if the scores would be more or less correlated if teaching to the test is taking place than if it is not taking place.

For a second cautionary example consider a middle schools situation. Suppose that the low stakes test is a respected national test in reading and mathematics, set at the appropriate grade level. Suppose that the high stakes test is set at a grade level that is two or three years too low in its demand on specific teachable student skills, but the test has a flavor that makes that it still differentiates certain abilities. For example, the mathematics component may place high demands on the students' verbal abilities. Again we may expect a strong positive correlation between the results of the two tests, and we may expect such a correlation whether or not there is "teaching to the test" for the high stakes one. In this case again, we would not be justified to conclude from our observation of a high correlation that the high stakes test is a good measure of student learning or of the performance of teachers or schools. We can only make that assessment by looking at the test content.

For a third example we consider a hypothetical pair of grade school mathematics tests. The low stakes test has broad coverage of the expected grade appropriate content, whereas the high stakes test is purely a test of fluency in arithmetic at the grade appropriate level. Arithmetic is certainly important for the pupils' performance on the broader low stakes test, and so we expect strong correlations between the results on the two tests. Even so we would find, upon looking at test content, that mathematics teaching directed strictly at the high stakes test is impoverished teaching.

The reader who delves into the Greene, Winters and Forster report will look in vain for such cautionary considerations. The report more or less postulates that strong positive correlation between the high stakes and the low stakes test validates the high stakes test in the sense that "teaching to the test" is compatible with good instruction. Quoting the report's Introduction:

If high stakes tests produce results that are similar to the results of other tests where there are no incentives to manipulate scores, which we might call "low stakes" tests, then we can have confidence that the high stakes do not themselves distort the outcomes.

For further discussion of the validity of high stakes tests I point to a report Toward a Framework for Validating Gains Under High-Stakes Conditions by Daniel M. Koretz, Daniel F. McCaffrey, and Laura S. Hamilton (CRESST CSE Technical Report 551, Dec 2001). (They note with skepticism some earlier claims of Greene concerning the validity of gains on the Florida FCAT.)

At this point we should be very concerned about the study of Greene, Winters, and Forster. The report is going to look at nine jurisdictions (the states Florida and Virginia and seven separate school districts in other states) where there is both a high stakes and a suitable low stakes testing system. Of course the authors will find a strong positive correlation between the two tests in each case. This is entirely expected a priori and no conclusion of genuine interest can be drawn from it. Yet the authors appear to have set themselves up to draw from this observation the conclusion that high stakes tests appear not to distort teaching and learning.

In fact, there is a second angle in the study that, if it would be carefully exploited, could provide some saving grace. The report actually looks not only at the correlation between the two tests at one point in time, but also studies at the evolution of test scores over a number of years. That information may provide a way to factor out the expected natural correlation between the performance on the two tests and to see if there may be other influences.

It seems not absurd, for example, to offer the hypothesis that if performance on the high stakes test increases over time relative to performance on the low stakes test then this is evidence that "teaching to the test" is taking place and that the results of this teaching are not transferable. It might be so; on the other hand, maybe the high stakes test is evolving in a way that makes it easier. In any case, if the correlation between the performance on the two tests is changing over time then we have reason to be concerned about the quality of the testing program.

We turn now to the report's specific findings.

In the earlier cited presentation at the AEI Greene describes the year to year gain situation as follows.

In measuring year-to-year progress, or "gain scores," however, the levels of correlation varied across the districts. While Florida had a high correlation between its high-stakes and low-stakes gain scores, some of the other districts surveyed revealed near-zero correlations on gain scores. This result is troubling because it calls into question the ability of high-stakes assessments to give an accurate measurement of progress from year to year.

My assessment of the study outcome at this point is that we are in a deep puddle. The results are promising for Florida but quite negative for each of the remaining eight jurisdictions. Even in the case of Florida caution is in order. Without looking at the content of the two tests it may still be that the situation is as in my grade school or middle school example, where a strong focus on the high stakes test would hardly constitute best practice even though positive results could be visible on both the high stakes and the low stakes test. I might believe that Florida's high stakes testing program is proper and effective, but the crux will be to look at the test content.

The study conclusion offers a more firm and confident perspective.

Florida has incorporated value-added measures into its high stakes testing and accountability system, and the evidence shows that Florida has designed and implemented a high stakes testing system where the year-to-year score gains on the high stakes test correspond very closely with year-to-year score gains on standardized tests where there are no incentives to manipulate the results. This strong correlation suggests that the value-added results produced by Florida's high stakes testing system provide credible information about the influence schools have on student progress.

In all of the other school systems we examined, however, the correlations between score gains on high and low stakes tests are much weaker. We cannot be completely confident that those high stakes tests provide accurate information about school influence over student progress. However, the consistently high correlations we found between score levels on high and low stakes tests does justify a moderate level of confidence in the reliability of those high stakes tests.

I would agree that the study points to Florida as an interesting and promising case, as gains on the FCAT are reflected in gains on the Stanford-9. To understand this better and to assess its importance it would be essential to look closely at the content of both tests. The statement about the other systems, "We cannot be completely confident that ..." is stretching matters to the extreme. The lack of clear correlation between year to year gains in the other eight jurisdictions is suspicious, and if anything it is evidence that in many jurisdiction the high stakes tests are not effective as a guide to teaching and learning. The "consistently high correlations" were expected and are uninformative.

Even the very subdued caution of the report's conclusions is thrown in the wind in many of the press reports around the release of the study. For example, on Feb 12, 2003, study authors Greene and Winters write in the San Francisco Chronicle:

We found very strong correlations between the results of high-stakes tests and low-stakes tests in nearly every jurisdiction that we studied. Nationally, 77 percent of the variation in high-stakes test results examined could be explained by a school's performance on a low-stakes test. [...] Our findings suggest that if teachers facing high-stakes tests are focusing exclusively on material found on those tests, as some claim, then by doing so they are teaching skills that are generally useful rather than useful only on a single standardized test.

The report does stress, in the end: "Further research is needed to identify ways in which other school systems might modify their practices to produce results more like those in Florida." I agree that it is desirable to understand how the Florida case is different from the other eight cases.

With a view to that further research it is of interest to look at the study's brief summary of other approaches. In his AEI presentation Greene writes:

Previous research on this question has taken on four different forms, all of which I believe are somewhat problematic. One variety is anecdotal, where researchers interview teachers and administrators and attempt to gauge how their behavior has changed since the introduction of a high-stakes test or document examples of cheating or other manipulations of the test. [...] Other studies have compared the results of high-stakes tests to the results of college-oriented tests like the Scholastic Aptitude Test, ACT, and Advanced Placement tests, and found little correlation between the two. [...] Another approach is to compare classroom grades to high-stakes test scores. [...] The last approach commonly used is a comparison of high-stakes scores to National Assessment of Educational Progress scores [...] but the NAEP figures only give one number for a whole state, making it impossible to compare the numbers school by school or student by student.

Four different forms, all problematical, and then the form used in the study by Greene et al., also problematical. As a conclusion to this review I suggest that the research community is overlooking a very obvious approach that I would consider mandatory in conjunction with other approaches: Just look at the tests. For starters I offer my Web reference, Assessment Reviews - What Little I know.

Bas Braams
March 23, 2003 (rev. April 5)

[Addendum. For another perspective on the Greene, Winters, and Forster paper see Wobegon Republicans and Test Score Inflation, by Anonymous. That author argues that education officials have every bit as much motivation and plenty more opportunity to manipulate low-stakes tests as they do to manipulate high-stakes test.]

(Return to Links, Articles, Essays, and Opinions on K-12 Education or to BJB Essays and Opinions.)

The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by New York University.