Apples and Oranges

Here’s what Director of Schools Dorsey Hopson had to say amid reports that schools in his Shelby County district showed low growth according to recently released state test data:

Hopson acknowledged concerns over how the state compares results from “two very different tests which clearly are apples and oranges,” but he added that the district won’t use that as an excuse.

“Notwithstanding those questions, it’s the system upon which we’re evaluated on and judged,” he said.

State officials stand by TVAAS. They say drops in proficiency rates resulting from a harder test have no impact on the ability of teachers, schools and districts to earn strong TVAAS scores, since all students are experiencing the same change.

That’s all well and good, except when the system upon which you are evaluated is seriously flawed, it seems there’s an obligation to speak out and fight back.

Two years ago, ahead of what should have been the first year of TNReady, I wrote about the challenges of creating valid TVAAS scores while transitioning to a new test. TNReady was not just a different test, it was (is) a different type of test than the previous TCAP test. For example, it included constructed response questions instead of simply multiple choice bubble-in questions.

Here’s what I wrote:

Here’s the problem: There is no statistically valid way to predict expected growth on a new test based on the historic results of TCAP. First, the new test has (supposedly) not been fully designed. Second, the test is in a different format. It’s both computer-based and it contains constructed-response questions. That is, students must write-out answers and/or demonstrate their work.

Since Tennessee has never had a test like this, it’s impossible to predict growth at all. Not even with 10% confidence. Not with any confidence. It is the textbook definition of comparing apples to oranges.

Here’s a statement from the academic article I cited to support this claim:

Here’s what Lockwood and McCaffrey (2007) had to say in the Journal of Educational Measurement:

We find that the variation in estimated effects resulting from the different mathematics achievement measures is large relative to variation resulting from choices about model specification, and that the variation within teachers across achievement measures is larger than the variation across teachers.
You get different value-added results depending on the type of test you use. That is, you can’t just say this is a new test but we’ll compare peer groups from the old test and see what happens. Plus, TNReady presents the added challenge of not having been fully administered last year, so you’re now looking at data from two years ago and extrapolating to this year’s results.
Of course, the company paid millions to crunch the TVAAS numbers says that this transition presents no problem at all. Here’s what their technical document has to say about the matter:
In 2015-16, Tennessee implemented new End-of-Course (EOC) assessments in math and English/language arts. Redesigned assessments in Math and English/language arts were also implemented in grades 3-8 during the 2016-17 school year. Changes in testing regimes occur at regular intervals within any state, and these changes need not disrupt the continuity and use of value-added reporting by educators and policymakers. Based on twenty years of experience with providing valueadded and growth reporting to Tennessee educators, EVAAS has developed several ways to accommodate changes in testing regimes.
Prior to any value-added analyses with new tests, EVAAS verifies that the test’s scaling properties are suitable for such reporting. In addition to the criteria listed above, EVAAS verifies that the new test is related to the old test to ensure that the comparison from one year to the next is statistically reliable. Perfect correlation is not required, but there should be a strong relationship between the new test and old test. For example, a new Algebra I exam should be correlated to previous math scores in grades seven and eight and to a lesser extent other grades and subjects such as English/language arts and science. Once suitability of any new assessment has been confirmed, it is possible to use both the historical testing data and the new testing data to avoid any breaks or delays in value-added reporting.
A couple of problems with this. First, there was NO complete administration of a new testing regime in 2015-16. It didn’t happen.
Second, EVAAS doesn’t get paid if there’s not a way to generate these “growth scores” so it is in their interest to find some justification for comparing the two very different tests.
Third, researchers who study value-added modeling are highly skeptical of the reliability of comparisons between different types of tests when it comes to generating value-added scores. I noted Lockwood and McCaffrey (2007) above. Here are some more:
John Papay (2011) did a similar study using three different reading tests, with similar results. He stated his conclusion as follows: [T]he correlations between teacher value-added estimates derived from three separate reading tests — the state test, SRI [Scholastic Reading Inventory], and SAT [Stanford Achievement Test] — range from 0.15 to 0.58 across a wide range of model specifications. Although these correlations are moderately high, these assessments produce substantially different answers about individual teacher performance and do not rank individual teachers consistently. Even using the same test but varying the timing of the baseline and outcome measure introduces a great deal of instability to teacher rankings.
Two points worth noting here: First, different tests yield different value-added scores. Second, even using the same test but varying the timing can create instability in growth measures.
Then, there’s data from the Measures of Effective Teaching (MET) Project, which included data from Memphis. In terms of reliability when using value-added among different types of tests, here’s what MET reported:
Once more, the MET study offered corroborating evidence. The correlation between value-added scores based on two different mathematics tests given to the same students the same year was only .38. For 2 different reading tests, the correlation was .22 (the MET Project, 2010, pp. 23, 25).
Despite the claims of EVAAS, the academic research raises significant concerns about extrapolating results from different types of tests. In short, when you move to a different test, you get different value-added results. As I noted in 2015:

If you measure different skills, you get different results. That decreases (or eliminates) the reliability of those results. TNReady is measuring different skills in a different format than TCAP. It’s BOTH a different type of test AND a test on different standards. Any value-added comparison between the two tests is statistically suspect, at best. In the first year, such a comparison is invalid and unreliable. As more years of data become available, it may be possible to make some correlation between past TCAP results and TNReady scores.

Or, if the state is determined to use growth scores (and wants to use them with accuracy), they will wait several years and build completely new growth models based on TNReady alone. At least three years of data would be needed in order to build such a model.

Dorsey Hopson and other Directors of Schools should be pushing back aggressively. Educators should be outraged. After all, this unreliable data will be used as a portion of their teacher evaluations this year. Schools are being rated on a 1-5 scale based on a growth model grounded in suspect methods.

How much is this apple like last year’s orange? How much will this apple ever be like last year’s orange?

If we’re determined to use value-added modeling to measure school-wide growth or district performance, we should at least be determined to do it in a way that ensures valid, reliable results.

For more on education politics and policy in Tennessee, follow @TNEdReport


Mike Stein on the Teachers’ Bill of Rights

Coffee County teacher Mike Stein offers his thoughts on the Teachers’ Bill of Rights (SB14/HB1074) being sponsored at the General Assembly by Mark Green of Clarksville and Jay Reedy of Erin.

Here’s some of what he has to say:

In my view, the most impactful elements of the Teachers’ Bill of Rights are the last four items. Teachers have been saying for decades that we shouldn’t be expected to purchase our own school supplies. No other profession does that. Additionally, it makes much-needed changes to the evaluation system. It is difficult, if not impossible, to argue against the notion that we should be evaluated by other educators with the same expertise. While good teaching is good teaching, there are content-specific strategies that only experts in that subject would truly be able to appreciate fully. Both the Coffee County Education Association and the Tennessee Education Association support this bill.

And here are those four items he references:

This bill further provides that an educator is not: (1) Required to spend the educator’s personal money to appropriately equip a classroom; (2) Evaluated by professionals, under the teacher evaluation advisory committee, without the same subject matter expertise as the educator; (3) Evaluated based on the performance of students whom the educator has never taught; or (4) Relocated to a different school based solely on test scores from state mandated assessments.

The legislation would change the teacher evaluation system by effectively eliminating TVAAS scores from the evaluations of teachers in non-tested subjects — those scores may be replaced by portfolios, an idea the state has rolled out but not funded. Additionally, identifying subject matter specific evaluators could prove difficult, but would likely provide stronger, more relevant evaluations.

Currently, teachers aren’t required to spend their own money on classrooms, but many teachers do because schools too often lack the resources to meet the needs of students. It’s good to see Senator Green and Rep. Reedy drawing attention to the important issue of classroom resources.

For more on education politics and policy in Tennessee, follow @TNEdReport


Knox County Takes a Stand

Last night, the Knox County School Board voted 6-3 in favor of a resolution calling on the General Assembly and State Board of Education to waive the use of TCAP/TNReady data in student grades and teacher evaluations this year.

The move comes as the state prepares to administer the tests this year with a new vendor following last year’s TNReady disaster. The lack of a complete testing cycle last year plus the addition of a new vendor means this year is the first year of the new test.

The Board passed the resolution in spite of Governor Haslam warning against taking such a step.

In his warning, Haslam said:

“The results we’ve seen are not by accident in Tennessee, and I think you have to be really careful about doing anything that could cause that to back up,” Haslam said.

He added:

Haslam attributed that progress to three things, including tying standardized tests to teacher evaluations.

“It’s about raising our standards and expectations, it’s about having year-end assessments that match those standards and then I think it’s about having assessments that are part of teachers’ evaluations,” Haslam said. “I think that you have to have all of those for a recipe for success.”

Haslam can present no evidence for his claim about the use of student assessment in teacher evaluation. In fact, it’s worth noting that prior to 2008, Tennessee students achieved at a high level according to what were then the state standards. While the standards themselves were determined to need improvement, the point is teachers were helping students hit the designated mark.

Teachers were moving students forward at this time without evaluations tied to student test results. Policymakers set a mark for student performance, teachers worked to hit that mark and succeeded. Standards were raised in 2008, and since then, Tennessee has seen detectable growth in overall results, including some exciting news when NAEP results are released.

To suggest that a year without the use of TVAAS scores in teacher evaluations will cause a setback is to insult Tennessee’s teachers. As if they’ll just relax and not teach as hard.

Another argument raised against the resolution is that it will somehow absolve teachers and students of accountability.

Joe Sullivan reports in the Knoxville Mercury:

In an email to board members, [Interim Director of Schools Buzz] Thomas asserted that, “We need a good standardized test each year to tell us how we are doing compared to others across the state and the nation. We will achieve greatness not by shying away from this accountability but by embracing it.” And he fretted that, “This resolution puts that at risk. In short, it will divide us. Once again we could find ourselves in two disputing camps. The pro-achievement folks on the one side and the pro-teacher folks on the other.”

Right now, we don’t know if we have a good standardized test. Taking a year to get it right is important, especially in light of the frustrations of last year’s TNReady experience.

Of course, there’s no need for pro-achievement and pro-teacher folks to be divided into two camps, either. Tennessee can have a good, solid test that is an accurate measure of student achievement and also treat teachers fairly in the evaluation process.

To be clear, teachers aren’t asking for a waiver from all evaluation. They are asking for a fair, transparent evaluation system. TVAAS has long been criticized as neither. Even under the best of circumstances, TVAAS provides a minimal level of useful information about teacher performance.

Now, we’re shifting to a new test. That shift alone makes it impossible to achieve a valid value-added score. In fact, researchers in the Journal of Educational Measurement have said:

We find that the variation in estimated effects resulting from the different mathematics achievement measures is large relative to variation resulting from choices about model specification, and that the variation within teachers across achievement measures is larger than the variation across teachers. These results suggest that conclusions about individual teachers’ performance based on value-added models can be sensitive to the ways in which student achievement is measured.
These findings align with similar findings by Martineau (2006) and Schmidt et al (2005)
You get different results depending on the type of question you’re measuring.

The researchers tested various VAM models (including the type used in TVAAS) and found that teacher effect estimates changed significantly based on both what was being measured AND how it was measured.

Changing to a new type of test creates value-added uncertainty. That means results attributed to teachers based on a comparison of this year’s tests and the old tests will not yield valid results.

While insisting that districts use TVAAS in teacher evaluations this year, the state is also admitting it’s not quite sure how that will work.

From Sullivan’s story:

When asked how these determinations will be made, a spokesperson for the state Department of Education acknowledges that a different methodology will have to be employed and says that, “we are still working with various statisticians and experts to determine the exact methodology we will use this year.”

Why not at take at least a year, be sure there’s a test that works, and then build a model based on that? What harm would come from giving teachers and students a year with a test that’s just a test? Moreover, the best education researchers have already warned that testing transitions create value-added bumps. Why not avoid the bumps and work to create an evaluation system that is fair and transparent?

Knox County has taken a stand. We’ll soon see if others follow suit. And if the state is listening.

For more on education politics and policy in Tennessee, follow @TNEdReport



Bias Confirmed

Last year, I wrote about a study of Tennessee TVAAS scores conducted by Jessica Holloway-Libell. She examined 10 Tennessee school districts and their TVAAS score distribution. Her findings suggest that ELA teachers are less likely than Math teachers to receive positive TVAAS scores, and that middle school teachers generally, and middle school ELA teachers in particular, are more likely to receive lower TVAAS scores.

The findings, based on a sampling of districts, suggest one of two things:

1) Tennessee’s ELA teachers are NOT as effective as Tennessee’s Math teachers and the middle school teachers are less effective than the high school teachers


2) TVAAS scores are biased against ELA teachers (or in favor of Math teachers) due to the nature of the subjects being tested.

The second option actually has support from data analysis, as I indicated at the time and repeat here:

Holloway-Libell’s findings are consistent with those of Lockwood and McCaffrey (2007) published in the Journal of Educational Measurement:

The researchers tested various VAM models and found that teacher effect estimates changed significantly based on both what was being measured AND how it was measured.

That is, it’s totally consistent with VAM to have different estimates for math and ELA teachers, for example. Math questions are often asked in a different manner than ELA questions and the assessment is covering different subject matter.

Now, there’s even more evidence to suggest that TVAAS scores vary based on subject matter and grade level – which would minimize their ability to provide meaningful information about teacher effectiveness.

A recently released study about effective teaching in Tennessee includes the following information:

The study used TVAAS scores alone to determine a student’s access to “effective teaching.” A teacher receiving a TVAAS score of a 4 or 5 was determined to be “highly effective” for the purposes of the study. The findings indicate that Math teachers are more likely to be rated effective by TVAAS than ELA teachers and that ELA teachers in grades 4-8 (mostly middle school grades) were the least likely to be rated effective. These findings offer support for the similar findings made by Holloway-Libell in a sample of districts. They are particularly noteworthy because they are more comprehensive, including most districts in the state.

Here’s a breakdown of the findings by percentage of teachers rated effective and including the number of districts used to determine the average.

4-8 Math           47.5% effective                        126 districts

HS Math            38.9% effective                          94 districts

4-8 ELA              24.2% effective                      131 districts

HS ELA               31.1% effective                       100 districts

So, TVAAS scores are more likely to result in math teachers being rated effective and middle school ELA teachers are the least likely to receive effective ratings.

Again, the question is: Are Tennessee’s ELA teachers really worse than our Math teachers? And, are middle school ELA teachers the worst teachers in Tennessee?

Alternatively, one might suppose that TVAAS, as data from other value-added models suggests, is susceptible to subject matter bias, and to a lesser extent, grade level bias.

That is, the data generated by TVAAS is not a reliable predictor of teacher performance.

For more on education politics and policy in Tennessee, follow @TNEdReport


Not Yet Ready for Teacher Evaluation?

Last night, the Knox County Board of Education passed a resolution asking the state to not count this year’s new TNReady test in teacher evaluation.

Board members cited the grace period the state is granting to students as one reason for the request. While standardized test scores count in student grades, the state has granted a waiver of that requirement in the first year of the new test.

However, no such waiver was granted for teachers, who are evaluated using student test scores and a metric known as value-added modeling that purports to reflect student growth.

Instead, the Department of Education proposed and the legislature supported a plan to phase-in the TNReady scores in teacher evaluations. This plan presents problems in terms of statistical validity.

Additionally, the American Educational Research Association released a statement recently cautioning states against using value-added models in high-stakes decisions involving teachers:

In a statement released today, the American Educational Research Association (AERA) advises those using or considering use of value-added models (VAM) about the scientific and technical limitations of these measures for evaluating educators and programs that prepare teachers. The statement, approved by AERA Council, cautions against the use of VAM for high-stakes decisions regarding educators.

So, regardless of the phase-in of TNReady, value-added models for evaluating teachers are problematic. When you add the transition to a new test to the mix, you only compound the existing problems, making any “score” assigned to a teacher even more unreliable.

Tullahoma City Schools Superintendent Dan Lawson spoke to the challenges with TVAAS recently in a letter he released in which he noted:

Our teachers are tasked with a tremendous responsibility and our principals who provide direct supervision assign teachers to areas where they are most needed. The excessive reliance on production of a “teacher number” produces stress, a lack of confidence and a drive to first protect oneself rather than best educate the child.

It will be interesting to see if other school systems follow Knox County’s lead on this front. Even more interesting: Will the legislature take action and at the least, waive the TNReady scores from teacher evaluations in the first year of the new test?

A more serious, long-term concern is the use of value-added modeling in teacher evaluation and, especially, in high-stakes decisions like the granting of tenure, pay, and hiring/firing.

More on Value-Added Modeling

The Absurdity of VAM

Unreliable and Invalid

Some Inconvenient Facts About VAM

For more on education politics and policy in Tennessee, follow @TNEdReport