TVAAS Reliability – Tennessee Education Report

Here’s what Director of Schools Dorsey Hopson had to say amid reports that schools in his Shelby County district showed low growth according to recently released state test data:

Hopson acknowledged concerns over how the state compares results from “two very different tests which clearly are apples and oranges,” but he added that the district won’t use that as an excuse.

“Notwithstanding those questions, it’s the system upon which we’re evaluated on and judged,” he said.

State officials stand by TVAAS. They say drops in proficiency rates resulting from a harder test have no impact on the ability of teachers, schools and districts to earn strong TVAAS scores, since all students are experiencing the same change.

That’s all well and good, except when the system upon which you are evaluated is seriously flawed, it seems there’s an obligation to speak out and fight back.

Two years ago, ahead of what should have been the first year of TNReady, I wrote about the challenges of creating valid TVAAS scores while transitioning to a new test. TNReady was not just a different test, it was (is) a different type of test than the previous TCAP test. For example, it included constructed response questions instead of simply multiple choice bubble-in questions.

Here’s what I wrote:

Here’s the problem: There is no statistically valid way to predict expected growth on a new test based on the historic results of TCAP. First, the new test has (supposedly) not been fully designed. Second, the test is in a different format. It’s both computer-based and it contains constructed-response questions. That is, students must write-out answers and/or demonstrate their work.

Since Tennessee has never had a test like this, it’s impossible to predict growth at all. Not even with 10% confidence. Not with any confidence. It is the textbook definition of comparing apples to oranges.

Here’s a statement from the academic article I cited to support this claim:

Here’s what Lockwood and McCaffrey (2007) had to say in the Journal of Educational Measurement:

We find that the variation in estimated effects resulting from the different mathematics achievement measures is large relative to variation resulting from choices about model specification, and that the variation within teachers across achievement measures is larger than the variation across teachers.

You get different value-added results depending on the type of test you use. That is, you can’t just say this is a new test but we’ll compare peer groups from the old test and see what happens. Plus, TNReady presents the added challenge of not having been fully administered last year, so you’re now looking at data from two years ago and extrapolating to this year’s results.

Of course, the company paid millions to crunch the TVAAS numbers says that this transition presents no problem at all. Here’s what their technical document has to say about the matter:

In 2015-16, Tennessee implemented new End-of-Course (EOC) assessments in math and English/language arts. Redesigned assessments in Math and English/language arts were also implemented in grades 3-8 during the 2016-17 school year. Changes in testing regimes occur at regular intervals within any state, and these changes need not disrupt the continuity and use of value-added reporting by educators and policymakers. Based on twenty years of experience with providing valueadded and growth reporting to Tennessee educators, EVAAS has developed several ways to accommodate changes in testing regimes.

Prior to any value-added analyses with new tests, EVAAS verifies that the test’s scaling properties are suitable for such reporting. In addition to the criteria listed above, EVAAS verifies that the new test is related to the old test to ensure that the comparison from one year to the next is statistically reliable. Perfect correlation is not required, but there should be a strong relationship between the new test and old test. For example, a new Algebra I exam should be correlated to previous math scores in grades seven and eight and to a lesser extent other grades and subjects such as English/language arts and science. Once suitability of any new assessment has been confirmed, it is possible to use both the historical testing data and the new testing data to avoid any breaks or delays in value-added reporting.

A couple of problems with this. First, there was NO complete administration of a new testing regime in 2015-16. It didn’t happen.

Second, EVAAS doesn’t get paid if there’s not a way to generate these “growth scores” so it is in their interest to find some justification for comparing the two very different tests.

Third, researchers who study value-added modeling are highly skeptical of the reliability of comparisons between different types of tests when it comes to generating value-added scores. I noted Lockwood and McCaffrey (2007) above. Here are some more:

John Papay (2011) did a similar study using three different reading tests, with similar results. He stated his conclusion as follows: [T]he correlations between teacher value-added estimates derived from three separate reading tests — the state test, SRI [Scholastic Reading Inventory], and SAT [Stanford Achievement Test] — range from 0.15 to 0.58 across a wide range of model specifications. Although these correlations are moderately high, these assessments produce substantially different answers about individual teacher performance and do not rank individual teachers consistently. Even using the same test but varying the timing of the baseline and outcome measure introduces a great deal of instability to teacher rankings.

Two points worth noting here: First, different tests yield different value-added scores. Second, even using the same test but varying the timing can create instability in growth measures.

Then, there’s data from the Measures of Effective Teaching (MET) Project, which included data from Memphis. In terms of reliability when using value-added among different types of tests, here’s what MET reported:

Once more, the MET study offered corroborating evidence. The correlation between value-added scores based on two different mathematics tests given to the same students the same year was only .38. For 2 different reading tests, the correlation was .22 (the MET Project, 2010, pp. 23, 25).

Despite the claims of EVAAS, the academic research raises significant concerns about extrapolating results from different types of tests. In short, when you move to a different test, you get different value-added results. As I noted in 2015:

If you measure different skills, you get different results. That decreases (or eliminates) the reliability of those results. TNReady is measuring different skills in a different format than TCAP. It’s BOTH a different type of test AND a test on different standards. Any value-added comparison between the two tests is statistically suspect, at best. In the first year, such a comparison is invalid and unreliable. As more years of data become available, it may be possible to make some correlation between past TCAP results and TNReady scores.

Or, if the state is determined to use growth scores (and wants to use them with accuracy), they will wait several years and build completely new growth models based on TNReady alone. At least three years of data would be needed in order to build such a model.

Dorsey Hopson and other Directors of Schools should be pushing back aggressively. Educators should be outraged. After all, this unreliable data will be used as a portion of their teacher evaluations this year. Schools are being rated on a 1-5 scale based on a growth model grounded in suspect methods.

How much is this apple like last year’s orange? How much will this apple ever be like last year’s orange?

If we’re determined to use value-added modeling to measure school-wide growth or district performance, we should at least be determined to do it in a way that ensures valid, reliable results.

For more on education politics and policy in Tennessee, follow @TNEdReport

Or, she will be. The Commissioner of Education is going on a statewide tour to talk about testing in light of new flexibility offered to the states under the federal ESSA law, which replaced No Child Left Behind.

From the DOE’s press release:

Commissioner Candice McQueen and senior department leaders are launching a statewide listening tour to gather input from educators, key advocates, parents, students, and the public to determine how to implement specific components of the nation’s new federal education law: the Every Student Succeeds Act (ESSA). The feedback will inform a Tennessee-specific ESSA plan that will guide the department’s work over the coming years and help the state capitalize on the new law’s empowerment of local leadership. These conversations will also build off feedback the commissioner has received on her Classroom Chronicles tour, during which she has met with more than 10,000 Tennessee teachers to learn how policies impact the classroom.

“We need to continue to elevate educators’ ideas to strengthen our education system, and the new federal law provides an opportunity to do that,” said Education Commissioner Candice McQueen. “We look forward to hearing from a variety of educators – from classroom teachers to directors of schools – as well as advocates, parents, and students as we craft a plan for Tennessee to transition to ESSA.”

The release notes that some policy changes might be in order:

Over the summer and fall, department leadership will draft a plan for transitioning to ESSA based on stakeholder and public feedback. Stakeholders and the general public will have another opportunity to provide input on the draft plan later this fall. In spring 2017, the department will work with stakeholder groups, the State Board of Education, and the Tennessee General Assembly as needed to recommend changes to state law and policy, as well as develop further guidance for school districts.

In addition to the various feedback loops and meetings across the state, the department will also be guided by its strategic plan, Tennessee Succeeds, which was developed with input from thousands of stakeholders over the course of several months to establish a clear vision for the future of Tennessee’s schools. It also has established a solid foundation in preparing to transition to ESSA.

Interestingly, the strategic plan referenced includes this under the category of Accountability:

Pilot first grade and career and technical education portfolio models in 2016, and continue to develop additional portfolio options for teachers in non-tested grades and subjects

Develop additional valid and reliable student growth measures for those areas that do not currently have them

Perhaps one improvement that will be suggested is that in addition to developing portfolio models for teacher evaluation (they already exist for related-arts teachers), the state should also provide funding to districts to support their implementation. Few districts use the state’s approved portfolio model for non-tested related arts teachers, likely because the cost of doing so is not covered by the state. Assessment includes both additional staff time and compensation for those performing the portfolio assessments.

The second item of note is: Develop additional valid and reliable student growth measures for those areas that do not currently have them.

This statement assumes that current methods of evaluating student growth (TVAAS) are valid and reliable. To put it simply, they’re not. Additionally, the most common method of assessing student growth is through standardized testing. This raises the possibility that additional tests will be provided for subjects not currently tested. After this year’s TNReady failure, it seems to me we should be exploring other options.

Nevertheless, I’m hopeful that this summer’s listening tour will lead to a new dialogue about Tennessee’s direction in education in light of ESSA. States like Hawaii are already taking student test scores out of the teacher evaluation process and moving toward new measures of evaluation.

Out of the chaos of TNReady, there is opportunity. Educators, parents, and students should attend these summer meetings and share their views on a new path forward for our state’s schools.

For more on education politics and policy in Tennessee, follow @TNEdReport

Tennessee Education Report

Education politics and policy in the Volunteer State

Tag Archives: TVAAS Reliability

Apples and Oranges

Candice is Listening