Apples and Oranges

Here’s what Director of Schools Dorsey Hopson had to say amid reports that schools in his Shelby County district showed low growth according to recently released state test data:

Hopson acknowledged concerns over how the state compares results from “two very different tests which clearly are apples and oranges,” but he added that the district won’t use that as an excuse.

“Notwithstanding those questions, it’s the system upon which we’re evaluated on and judged,” he said.

State officials stand by TVAAS. They say drops in proficiency rates resulting from a harder test have no impact on the ability of teachers, schools and districts to earn strong TVAAS scores, since all students are experiencing the same change.

That’s all well and good, except when the system upon which you are evaluated is seriously flawed, it seems there’s an obligation to speak out and fight back.

Two years ago, ahead of what should have been the first year of TNReady, I wrote about the challenges of creating valid TVAAS scores while transitioning to a new test. TNReady was not just a different test, it was (is) a different type of test than the previous TCAP test. For example, it included constructed response questions instead of simply multiple choice bubble-in questions.

Here’s what I wrote:

Here’s the problem: There is no statistically valid way to predict expected growth on a new test based on the historic results of TCAP. First, the new test has (supposedly) not been fully designed. Second, the test is in a different format. It’s both computer-based and it contains constructed-response questions. That is, students must write-out answers and/or demonstrate their work.

Since Tennessee has never had a test like this, it’s impossible to predict growth at all. Not even with 10% confidence. Not with any confidence. It is the textbook definition of comparing apples to oranges.

Here’s a statement from the academic article I cited to support this claim:

Here’s what Lockwood and McCaffrey (2007) had to say in the Journal of Educational Measurement:

We find that the variation in estimated effects resulting from the different mathematics achievement measures is large relative to variation resulting from choices about model specification, and that the variation within teachers across achievement measures is larger than the variation across teachers.
You get different value-added results depending on the type of test you use. That is, you can’t just say this is a new test but we’ll compare peer groups from the old test and see what happens. Plus, TNReady presents the added challenge of not having been fully administered last year, so you’re now looking at data from two years ago and extrapolating to this year’s results.
Of course, the company paid millions to crunch the TVAAS numbers says that this transition presents no problem at all. Here’s what their technical document has to say about the matter:
In 2015-16, Tennessee implemented new End-of-Course (EOC) assessments in math and English/language arts. Redesigned assessments in Math and English/language arts were also implemented in grades 3-8 during the 2016-17 school year. Changes in testing regimes occur at regular intervals within any state, and these changes need not disrupt the continuity and use of value-added reporting by educators and policymakers. Based on twenty years of experience with providing valueadded and growth reporting to Tennessee educators, EVAAS has developed several ways to accommodate changes in testing regimes.
Prior to any value-added analyses with new tests, EVAAS verifies that the test’s scaling properties are suitable for such reporting. In addition to the criteria listed above, EVAAS verifies that the new test is related to the old test to ensure that the comparison from one year to the next is statistically reliable. Perfect correlation is not required, but there should be a strong relationship between the new test and old test. For example, a new Algebra I exam should be correlated to previous math scores in grades seven and eight and to a lesser extent other grades and subjects such as English/language arts and science. Once suitability of any new assessment has been confirmed, it is possible to use both the historical testing data and the new testing data to avoid any breaks or delays in value-added reporting.
A couple of problems with this. First, there was NO complete administration of a new testing regime in 2015-16. It didn’t happen.
Second, EVAAS doesn’t get paid if there’s not a way to generate these “growth scores” so it is in their interest to find some justification for comparing the two very different tests.
Third, researchers who study value-added modeling are highly skeptical of the reliability of comparisons between different types of tests when it comes to generating value-added scores. I noted Lockwood and McCaffrey (2007) above. Here are some more:
John Papay (2011) did a similar study using three different reading tests, with similar results. He stated his conclusion as follows: [T]he correlations between teacher value-added estimates derived from three separate reading tests — the state test, SRI [Scholastic Reading Inventory], and SAT [Stanford Achievement Test] — range from 0.15 to 0.58 across a wide range of model specifications. Although these correlations are moderately high, these assessments produce substantially different answers about individual teacher performance and do not rank individual teachers consistently. Even using the same test but varying the timing of the baseline and outcome measure introduces a great deal of instability to teacher rankings.
Two points worth noting here: First, different tests yield different value-added scores. Second, even using the same test but varying the timing can create instability in growth measures.
Then, there’s data from the Measures of Effective Teaching (MET) Project, which included data from Memphis. In terms of reliability when using value-added among different types of tests, here’s what MET reported:
Once more, the MET study offered corroborating evidence. The correlation between value-added scores based on two different mathematics tests given to the same students the same year was only .38. For 2 different reading tests, the correlation was .22 (the MET Project, 2010, pp. 23, 25).
Despite the claims of EVAAS, the academic research raises significant concerns about extrapolating results from different types of tests. In short, when you move to a different test, you get different value-added results. As I noted in 2015:

If you measure different skills, you get different results. That decreases (or eliminates) the reliability of those results. TNReady is measuring different skills in a different format than TCAP. It’s BOTH a different type of test AND a test on different standards. Any value-added comparison between the two tests is statistically suspect, at best. In the first year, such a comparison is invalid and unreliable. As more years of data become available, it may be possible to make some correlation between past TCAP results and TNReady scores.

Or, if the state is determined to use growth scores (and wants to use them with accuracy), they will wait several years and build completely new growth models based on TNReady alone. At least three years of data would be needed in order to build such a model.

Dorsey Hopson and other Directors of Schools should be pushing back aggressively. Educators should be outraged. After all, this unreliable data will be used as a portion of their teacher evaluations this year. Schools are being rated on a 1-5 scale based on a growth model grounded in suspect methods.

How much is this apple like last year’s orange? How much will this apple ever be like last year’s orange?

If we’re determined to use value-added modeling to measure school-wide growth or district performance, we should at least be determined to do it in a way that ensures valid, reliable results.

For more on education politics and policy in Tennessee, follow @TNEdReport


 

Ready to Waive

Governor Bill Haslam and Commissioner of Education Candice McQueen announced today that in light of difficulties with the administration of the TNReady test, they are proposing that TNReady data NOT be included in this year’s round of teacher evaluations.

The statement comes after the Knox County Board of Education made a similar request by way of resolution in December. That resolution was followed by a statewide call for a waiver by a coalition of education advocacy groups. More recently, principals in Hamilton County weighed in on the issue.

Here’s Governor Haslam’s press release on the waiver:
Tennessee Gov. Bill Haslam today announced he would seek additional flexibility for teachers as the state continues its transition to the TNReady student assessment.

Under the proposal, teachers would have the choice to include or not to include student results from the 2015-2016 TNReady assessment in his or her evaluation score, which typically consists of multiple years of data. The proposal keeps student learning and accountability as factors in an educator’s evaluation while giving teachers the option to include this year’s results if the results benefit them. The governor will work with the General Assembly on specific language and a plan to move the proposal through the legislative process.

“Tennessee students are showing historic progress. The state made adjustments to teacher evaluation and accountability last year to account for the transition to an improved assessment fully aligned with Tennessee standards, which we know has involved a tremendous amount of work on the part of our educators,” Haslam said. “Given recent, unexpected changes in the administration of the new assessment, we want to provide teachers with additional flexibility for this first year’s data.”

Tennessee has led the nation with a teacher evaluation model that has played a vital role in the state’s unprecedented progress in education. Tennessee students are the fastest improving students in the country since 2011. The state’s graduation rate has increased three years in a row, standing at 88 percent. Since 2011, 131,000 more students are on grade-level in math and nearly 60,000 more on grade-level in science.  The plan builds upon the Teaching Evaluation Enhancement Act proposed by the governor and approved by the General Assembly last year. This year is the first administration of TNReady, which is fully aligned with the state’s college and career readiness benchmarks.

“Providing teachers with the flexibility to exclude first-year TNReady data from their growth score over the course of this transition will both directly address many concerns we have heard and strengthen our partnership with educators while we move forward with a new assessment,” Department of Education Commissioner Candice McQueen said. “Regardless of the test medium, TNReady will measure skills that the real world will require of our students.”

Most educator evaluations have three main components: qualitative data, which includes principal observations and always counts for at least half of an educator’s evaluation; a student achievement measure that the educator chooses; and a student growth score, which usually comprises 35 percent of the overall evaluation

 

While the release mentions last year’s changes to teacher evaluation to account for TNReady, it fails to note the validity problems created by an evaluation system moving from a multiple choice (TCAP) to a constructed-response test (TNReady).

Here’s the Tennessee Education Association on the announcement:

“TEA applauds Gov. Haslam on his proposal to give teachers the flexibility to not use TNReady test data in their 2015-16 evaluations. It is encouraging to see the governor listen to the widespread calls from educators, parents and local school boards for a one-year moratorium for TNReady data in teacher evaluations.”

 

“It is important that schools are given the same leniency as students and teachers during the transition to TNReady. These test scores that Gov. Haslam is acknowledging are too unreliable for use in teacher evaluations, are the same scores that can place a school on the priority list and make it eligible for state takeover. All high-stakes decisions tied to TNReady test data need to be waived for the 2015-16 school year.”

 

“While the governor’s proposal is a step in the right direction toward decoupling standardized test scores with high-stakes decisions, these measurements have proven to be unreliable statistical estimates that are inappropriate for use in teacher evaluations at all. TEA will continue its push to eliminate all standardized test scores from annual teacher evaluations.”

For more on education politics and policy in Tennessee, follow @TNEdReport