Student Test Scores: An Inaccurate Way to Judge Teachers

By Monty Neill, FairTest

Massachusetts is considering how to evaluate teachers in response to the demands of the federal Race to the Top program. On January 25, the state Board of Elementary and Secondary Education heard Gates Foundation staff researcher Thomas Kane present the results of a study backing so-called “value-added measurement” (VAM), a method of using student test scores to rate teachers. By March, the Board is expected to vote on a teacher evaluation plan. At this point, it appears that Kane is the only researcher that Board members will hear before making their decision.

Unfortunately, VAM is a bad tool for making judgments about teachers. Extensive research shows VAM will intensify the control of testing over curriculum and instruction and inaccurately and unfairly judge teachers. Many good teachers could receive bad rankings, and vice versa..

This is a national issue. U.S. Education Secretary Arne Duncan is pushing VAM hard. Some states have decided to make it count for half a teacher’s evaluation — for those who teach tested subjects.

So what is VAM — and why is it dangerous?

VAM is a way to compare test score changes among different teachers’ students. The growth in a student’s test scores from one year to the next is considered the “value” that has been added to the child’s achievement level. As students move from grade 3 to 4 to 5, etc., their scores can be tracked. With this data, usually using complex statistical formulas, the state would try to measure how much “value” a given teacher provides to her students each year. Her students might increase their average scores more or less than other teachers’ students, so teachers could be ranked on the basis of how much “value” their kids gain. In some states, the rankings have been publicized in the newspaper.

Unfortunately, standardized tests like MCAS are extremely narrow and inadequate measures of student learning that should not be used to make important decisions such as high school graduation. VAM creates additional distortions and inaccuracies. If used to make important decisions about teachers or principals, it will intensify the already destructive control tests have over teaching and learning. What’s more, the gains measured by VAM leave out a whole range of important knowledge and skills, since they are just gains on standardized math and English tests.

On the surface, it seems to make sense to look at student gains, rather than students’ one-time scores, since those scores are so dependent on a student’s class, race, disability status and knowledge of English. And VAM promises to take account of students’ backgrounds. But, research shows, the statistical techniques do not adequately adjust for different student populations, so teachers are still judged based on their students’ socioeconomic status. In addition, there are many other technical flaws in this measurement process. All told, this means the results are inaccurate.

For example, the researchers Peter Schochet and Hanley Chiang show that, even with three years of student test scores, teachers are rated inaccurately one time out of four. Tim Sass reports that more than two-thirds of the bottom-ranked teachers one year had moved out of the bottom ranks the next year. One third moved from the bottom 20 percent one year to the top 40 percent the next. Only a third who ranked highest one year kept their top ranking the next, and almost a third of the formerly top-ranked teachers landed in the bottom 40 percent in year two. Other studies have found similar instability in “value-added” rankings. Wayne Au explained, “Because of these error rates, a teacher’s performance evaluation may pivot on what amounts to a statistical roll of the dice.”

Kane’s research claimed that VAM based on state tests correlated with other measures, including different tests that supposedly measured deeper, conceptual understanding. But noted economist Jesse Rothstein reported the actual results of the Gates study don’t support that conclusion. The correlation between the two kinds of tests is “only slightly better than coin tosses.” Worse, the study only included other measures (such as observations) if those measures correlated positively with VAM results. That is, they decided ahead of time that VAM was best, then ignored anything that did not correspond with the VAM results.

With VAM, not only would teachers be inaccurately and unfairly judged, they would feel pressured to teach even more intensely to the test. That would further damage and limit our children’s education. Some teachers would be unjustly fired, and many would quit. Another aspect of VAM’s unfairness is that some teachers would be measured by their students’ MCAS scores, and others would not because their subjects or grades are not tested – unless the state greatly increases the amount of testing.

Simple indicators to evaluate or pay employees are used only rarely in other professions. When they are, the professionals engage in their version of teaching to the test, with often-disastrous results. Wall Street paid its speculators based on simplistic measures, and we are still suffering the consequences.

Teachers deserve high-quality evaluation, for fairness and to help them improve. MCAS is too weak to use for making decisions about students. When all the limitations and errors of VAM are added in, it renders the process “valueless addition” for teachers and their students.

Bibliography

Au, Wayne. 2010-11. Neither Fair Nor Accurate. In Rethinking Schools, Winter, pp. 34-38. Available at http://www.rethinkingschools.org/archive/25_02/25_02_au.shtml.

Baker, Eva, et al. 2010. August.  Problems with the Use of Student Test Scores to Evaluate Teachers. Economic Policy Institute.  http://www.epi.org/publications/entry/bp278.  Finds that “If the quality, coverage, and design of standardized tests were to improve, some concerns would be addressed, but the serious problems of attribution and nonrandom assignment of students, as well as the practical problems described above, would still argue for serious limits on the use of test scores for teacher evaluation.”

Corcoran, Sean P. 2010. Can Teachers be Evaluated by their Students’ Test Scores?

Should They Be? The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice. Annenberg Institute for School Reform. http://www.annenberginstitute.org/products/Corcoran.php. Finds that “the promise that value-added systems can provide such a precise, meaningful, and comprehensive picture is not supported by the data” (p. 28).

Di Carlo, Matthew. 2011, January 14. “The biggest flaw in Gates value-added study.” The Answer Sheet.  http://voices.washingtonpost.com/answer-sheet/guest-bloggers/the-biggest-flaw-in-the-gates.html.

FairTest. 2009, November. Paying Teachers for Student Test Scores Damages Schools and Undermines Learning. http://www.fairtest.org/paying-for-student-test-scores-damages-schools. Summarizes national and international research findings, in education and elsewhere.

McCaffrey, D., Koretz, D., Lockwood, J.R., and Hamilton, L. 2005. Evaluating Value-Added Models for Teacher Accountability. Santa Monica: RAND Corporation. Concluded that “the research base is currently insufficient to support the use of [value-added methods] for high-stakes decisions about individual teachers or schools.”

National Research Council, Board on Testing and Assessment. 2009. “Letter Report to the U.S. Department of Education on the Race to the Top Fund.” National Academy of Sciences, available at http://www.nap.edu/catalog.php?record_id=12780. Recommended against the requirement in Race to the Top to include student test scores in the evaluation of teachers.

Rothstein, Jesse. 2011, January. Review of “Learning About Teaching.” National Education Policy Center.  http://nepc.colorado.edu/thinktank/review-learning-about-teaching.

Sass, Tim R.  2008, November. The Stability of Value‐Added Measures of Teacher Quality and Implications for Teacher Compensation Policy. National Center for Analysis of Longitudinal Data in Education, Policy Brief 4.

http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf. (The full paper on which the brief is based is at http://www.urban.org/UploadedPDF/1001469-calder-working-paper-52.pdf.)

Schochet, Peter Z., and Hanley S. Chiang. 2010, July. Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains. U.S. Department of Education, Institute for Education Sciences. NCEE 2010-4004. http://ies.ed.gov/pubsearch/pubsinfo.asp?pubid=NCEE20104004.

[To return to The Backpack, close browser page.]