# Validating the Invalid?

The Tennessee House of Representatives passed legislation today (HB 108) that makes changes to current practice in teacher evaluation as Tennessee transitions to its new testing regime, TNReady.

The changes adjust the percentage of a teacher’s evaluation that is dependent on TVAAS scores to 10% next year, 20% the following year, and back to the current 35% by the 2017-18 academic year.

This plan is designed to allow for a transition period to the new TNReady tests which will include constructed-response questions and be aligned to the so-called Tennessee standards which match up with the Common Core State Standards.

Here’s the problem: There is no statistically valid way to predict expected growth on a new test based on the historic results of TCAP. First, the new test has (supposedly) not been fully designed. Second, the test is in a different format. It’s both computer-based and it contains constructed-response questions. That is, students must write-out answers and/or demonstrate their work.

Since Tennessee has never had a test like this, it’s impossible to predict growth at all. Not even with 10% confidence. Not with any confidence. It is the textbook definition of comparing apples to oranges.

Clearly, legislators feel like at the very least, this is an improvement. A reasonable accommodation to teachers as our state makes a transition.

But, how is using 10% of an invalid number a good thing? Should any part of a teacher’s evaluation be made up of a number that reveals nothing at all about that teacher’s performance?

While value-added data alone is a relatively poor predictor of teacher performance, the value-added estimate used next year is especially poor because it is not at all valid.

But, don’t just take my word for it. Researchers studying the validity of value-added measures asked whether value-added gave different results depending on the type of question asked. Particularly relevant now because Tennessee is shifting to a new test with different types of questions.

Here’s what Lockwood and McCaffrey (2007) had to say in the Journal of Educational Measurement:

We find that the variation in estimated effects resulting from the different mathematics achievement measures is large relative to variation resulting from choices about model specification, and that the variation within teachers across achievement measures is larger than the variation across teachers. These results suggest that conclusions about individual teachers’ performance based on value-added models can be sensitive to the ways in which student achievement is measured.
These findings align with similar findings by Martineau (2006) and Schmidt et al (2005)
You get different results depending on the type of question you’re measuring.

The researchers tested various VAM models (including the type used in TVAAS) and found that teacher effect estimates changed significantly based on both what was being measured AND how it was measured.

And they concluded:

Our results provide a clear example that caution is needed when interpreting estimated teacher effects because there is the potential for teacher performance to depend on the skills that are measured by the achievement tests.

If you measure different skills, you get different results. That decreases (or eliminates) the reliability of those results. TNReady is measuring different skills in a different format than TCAP. It’s BOTH a different type of test AND a test on different standards. Any value-added comparison between the two tests is statistically suspect, at best. In the first year, such a comparison is invalid and unreliable. As more years of data become available, it may be possible to make some correlation between past TCAP results and TNReady scores.

Or, if the state is determined to use growth scores (and wants to use them with accuracy), they will wait several years and build completely new growth models based on TNReady alone. At least three years of data would be needed in order to build such a model.

It seems likely that the Senate will follow the House’s lead on Monday and overwhelmingly support the proposed evaluation changes. But in doing so, they should be asking themselves if it’s really ok to base any part of a teacher’s evaluation on numbers that reliably predict nothing.

Real World Harms of Value-Added Data