Apples and Oranges

Here’s what Director of Schools Dorsey Hopson had to say amid reports that schools in his Shelby County district showed low growth according to recently released state test data:

Hopson acknowledged concerns over how the state compares results from “two very different tests which clearly are apples and oranges,” but he added that the district won’t use that as an excuse.

“Notwithstanding those questions, it’s the system upon which we’re evaluated on and judged,” he said.

State officials stand by TVAAS. They say drops in proficiency rates resulting from a harder test have no impact on the ability of teachers, schools and districts to earn strong TVAAS scores, since all students are experiencing the same change.

That’s all well and good, except when the system upon which you are evaluated is seriously flawed, it seems there’s an obligation to speak out and fight back.

Two years ago, ahead of what should have been the first year of TNReady, I wrote about the challenges of creating valid TVAAS scores while transitioning to a new test. TNReady was not just a different test, it was (is) a different type of test than the previous TCAP test. For example, it included constructed response questions instead of simply multiple choice bubble-in questions.

Here’s what I wrote:

Here’s the problem: There is no statistically valid way to predict expected growth on a new test based on the historic results of TCAP. First, the new test has (supposedly) not been fully designed. Second, the test is in a different format. It’s both computer-based and it contains constructed-response questions. That is, students must write-out answers and/or demonstrate their work.

Since Tennessee has never had a test like this, it’s impossible to predict growth at all. Not even with 10% confidence. Not with any confidence. It is the textbook definition of comparing apples to oranges.

Here’s a statement from the academic article I cited to support this claim:

Here’s what Lockwood and McCaffrey (2007) had to say in the Journal of Educational Measurement:

We find that the variation in estimated effects resulting from the different mathematics achievement measures is large relative to variation resulting from choices about model specification, and that the variation within teachers across achievement measures is larger than the variation across teachers.
You get different value-added results depending on the type of test you use. That is, you can’t just say this is a new test but we’ll compare peer groups from the old test and see what happens. Plus, TNReady presents the added challenge of not having been fully administered last year, so you’re now looking at data from two years ago and extrapolating to this year’s results.
Of course, the company paid millions to crunch the TVAAS numbers says that this transition presents no problem at all. Here’s what their technical document has to say about the matter:
In 2015-16, Tennessee implemented new End-of-Course (EOC) assessments in math and English/language arts. Redesigned assessments in Math and English/language arts were also implemented in grades 3-8 during the 2016-17 school year. Changes in testing regimes occur at regular intervals within any state, and these changes need not disrupt the continuity and use of value-added reporting by educators and policymakers. Based on twenty years of experience with providing valueadded and growth reporting to Tennessee educators, EVAAS has developed several ways to accommodate changes in testing regimes.
Prior to any value-added analyses with new tests, EVAAS verifies that the test’s scaling properties are suitable for such reporting. In addition to the criteria listed above, EVAAS verifies that the new test is related to the old test to ensure that the comparison from one year to the next is statistically reliable. Perfect correlation is not required, but there should be a strong relationship between the new test and old test. For example, a new Algebra I exam should be correlated to previous math scores in grades seven and eight and to a lesser extent other grades and subjects such as English/language arts and science. Once suitability of any new assessment has been confirmed, it is possible to use both the historical testing data and the new testing data to avoid any breaks or delays in value-added reporting.
A couple of problems with this. First, there was NO complete administration of a new testing regime in 2015-16. It didn’t happen.
Second, EVAAS doesn’t get paid if there’s not a way to generate these “growth scores” so it is in their interest to find some justification for comparing the two very different tests.
Third, researchers who study value-added modeling are highly skeptical of the reliability of comparisons between different types of tests when it comes to generating value-added scores. I noted Lockwood and McCaffrey (2007) above. Here are some more:
John Papay (2011) did a similar study using three different reading tests, with similar results. He stated his conclusion as follows: [T]he correlations between teacher value-added estimates derived from three separate reading tests — the state test, SRI [Scholastic Reading Inventory], and SAT [Stanford Achievement Test] — range from 0.15 to 0.58 across a wide range of model specifications. Although these correlations are moderately high, these assessments produce substantially different answers about individual teacher performance and do not rank individual teachers consistently. Even using the same test but varying the timing of the baseline and outcome measure introduces a great deal of instability to teacher rankings.
Two points worth noting here: First, different tests yield different value-added scores. Second, even using the same test but varying the timing can create instability in growth measures.
Then, there’s data from the Measures of Effective Teaching (MET) Project, which included data from Memphis. In terms of reliability when using value-added among different types of tests, here’s what MET reported:
Once more, the MET study offered corroborating evidence. The correlation between value-added scores based on two different mathematics tests given to the same students the same year was only .38. For 2 different reading tests, the correlation was .22 (the MET Project, 2010, pp. 23, 25).
Despite the claims of EVAAS, the academic research raises significant concerns about extrapolating results from different types of tests. In short, when you move to a different test, you get different value-added results. As I noted in 2015:

If you measure different skills, you get different results. That decreases (or eliminates) the reliability of those results. TNReady is measuring different skills in a different format than TCAP. It’s BOTH a different type of test AND a test on different standards. Any value-added comparison between the two tests is statistically suspect, at best. In the first year, such a comparison is invalid and unreliable. As more years of data become available, it may be possible to make some correlation between past TCAP results and TNReady scores.

Or, if the state is determined to use growth scores (and wants to use them with accuracy), they will wait several years and build completely new growth models based on TNReady alone. At least three years of data would be needed in order to build such a model.

Dorsey Hopson and other Directors of Schools should be pushing back aggressively. Educators should be outraged. After all, this unreliable data will be used as a portion of their teacher evaluations this year. Schools are being rated on a 1-5 scale based on a growth model grounded in suspect methods.

How much is this apple like last year’s orange? How much will this apple ever be like last year’s orange?

If we’re determined to use value-added modeling to measure school-wide growth or district performance, we should at least be determined to do it in a way that ensures valid, reliable results.

For more on education politics and policy in Tennessee, follow @TNEdReport


 

It Doesn’t Matter Except When It Does

This year’s TNReady quick score setback means some districts will use the results in student report cards and some won’t. Of course, that’s nobody’s fault. 

One interesting note out of all of this came as Commissioner McQueen noted that quick scores aren’t what really matters anyway. Chalkbeat reports:

The commissioner emphasized that the data that matters most is not the preliminary data but the final score reports, which are scheduled for release in July for high schools and the fall for grades 3-8. Those scores are factored into teachers’ evaluations and are also used to measure the effectiveness of schools and districts.

“Not until you get the score report will you have the full context of a student’s performance level and strengths and weaknesses in relation to the standards,” she said.

The early data matters to districts, though, since Tennessee has tied the scores to student grades since 2011.

First, tying the quick scores to student grades is problematic. Assuming TNReady is a good, reliable test, we’d want the best results to be used in any grade calculation. Using pencil and paper this year makes that impossible. Even when we switch to a test fully administered online, it may not be possible to get the full scores back in time to use those in student grades.

Shifting to a model that uses TNReady to inform and diagnose rather than evaluate students and teachers could help address this issue. Shifting further to a project-based assessment model could actually help students while also serving as a more accurate indicator of whether they have met the standards.

Next, the story notes that teachers will be evaluated based on the scores. This will be done via TVAAS — the state’s value-added modeling system. Even as more states move away from value-added models in teacher evaluation, Tennessee continues to insist on using this flawed model.

Again, let’s assume TNReady is an amazing test that truly measures student mastery of standards. It’s still NOT designed for the purpose of evaluating teacher performance. Further, this is the first year the test has been administered. That means it’s simply not possible to generate valid data on teacher performance from this year’s results. You can’t just take this year’s test (TNReady) and compare it to the TCAP from two years ago. They are different tests designed to measure different standards in a different way. You know, the old apples and oranges thing.

One teacher had this to say about the situation:

“There’s so much time and stress on students, and here again it’s not ready,” said Tikeila Rucker, a Memphis teacher who is president of the United Education Association of Shelby County.

For more on education politics and policy in Tennessee, follow @TNEdReport


 

Mike Stein on the Teachers’ Bill of Rights

Coffee County teacher Mike Stein offers his thoughts on the Teachers’ Bill of Rights (SB14/HB1074) being sponsored at the General Assembly by Mark Green of Clarksville and Jay Reedy of Erin.

Here’s some of what he has to say:

In my view, the most impactful elements of the Teachers’ Bill of Rights are the last four items. Teachers have been saying for decades that we shouldn’t be expected to purchase our own school supplies. No other profession does that. Additionally, it makes much-needed changes to the evaluation system. It is difficult, if not impossible, to argue against the notion that we should be evaluated by other educators with the same expertise. While good teaching is good teaching, there are content-specific strategies that only experts in that subject would truly be able to appreciate fully. Both the Coffee County Education Association and the Tennessee Education Association support this bill.

And here are those four items he references:

This bill further provides that an educator is not: (1) Required to spend the educator’s personal money to appropriately equip a classroom; (2) Evaluated by professionals, under the teacher evaluation advisory committee, without the same subject matter expertise as the educator; (3) Evaluated based on the performance of students whom the educator has never taught; or (4) Relocated to a different school based solely on test scores from state mandated assessments.

The legislation would change the teacher evaluation system by effectively eliminating TVAAS scores from the evaluations of teachers in non-tested subjects — those scores may be replaced by portfolios, an idea the state has rolled out but not funded. Additionally, identifying subject matter specific evaluators could prove difficult, but would likely provide stronger, more relevant evaluations.

Currently, teachers aren’t required to spend their own money on classrooms, but many teachers do because schools too often lack the resources to meet the needs of students. It’s good to see Senator Green and Rep. Reedy drawing attention to the important issue of classroom resources.

For more on education politics and policy in Tennessee, follow @TNEdReport


 

Reform is Working

That’s the message from the Tennessee Department of Education based on recently released TCAP results and an analysis of the data over time.

You can see for yourself here and here.

The one area of concern is reading, but overall, students are performing better than they were when new TCAP tests were started and standards were raised.

Here’s the interesting thing: This is true across school districts and demographic subgroups. The trend is positive.

Here’s something else: A similar trend could be seen in results before the change in the test in 2009.

Tennessee students were steadily making gains. Teachers and schools were hitting the mark set for them by policymakers. This in an age of collective bargaining for teachers and no TVAAS-based evaluation or pay schemes.

When the standards were made higher — certainly a welcome change — teachers again hit the mark.

Of course, since the standards change, lots of other reforms have taken place. Most of these have centered around teachers and the incorporation of TVAAS in teacher evaluation and even pay schemes. The State Board of Education even gutted the old state salary schedule to promote pay differentiation, ostensibly based on TVAAS scores.

But does pay for TVAAS actually lead to improved student outcomes as measured by TVAAS?

Consider this comparison of Putnam County and Cumberland County. Putnam was one of the original TIF recipients and among the first to develop a pay scheme based on teacher evaluations and TVAAS.

Putnam’s 2014 TVAAS results are positive, to be sure. But neighboring Cumberland County (a district that is demographically similar and has a similar assortment of schools) also shows positive TVAAS results.  Cumberland relies on the traditional teacher pay scale. From 2012-13 to 2013-14, Putnam saw a 50% increase in the number of categories (all schools included) in which they earned TVAAS scores of 5. So did Cumberland County.

Likewise, from 2012-13 to 2013-14, Putnam saw a 13% decline in the number of categories in which they earned TVAAS scores below a 3. In Cumberland County, the number was cut by 11%.

This is one example over a two-year cycle. New district level results for 2015 will soon be available and will warrant an update. But, it’s also worth noting that these results track results seen in Denver in analysis of their ProComp pay system. Specifially, University of Colorado’s Denver ProComp Evaluation Report (2010-2012) finds little impact of ProComp on student achievement, or on teachers’ professional practices, including their teaching practices or retention.

The Putnam-Cumberland initial analysis tracks with that of the ProComp studies: Teacher performance pay, even if devised in conjunction with teacher groups, cannot be said to have a significant impact on student performance over time.

So, prior to 2008, student academic achievement as measured by Tennessee standardized tests showed steady improvement over time. This occurred in an environment with no performance pay. Again from 2009-2015, across districts and demographic groups, student achievement is improving. Only a small number of Tennessee districts have performance pay schemes — so, that alone would indicate that performance pay is not driving improved student outcomes.  Then, a preliminary comparison of two districts suggests that both performance pay and non-performance pay districts see significant (and similar) TVAAS gains.

Reform may be working — but it may not be the reform the reformers want to push.

For more on education politics and policy in Tennessee, follow @TNEdReport

The Value of the Report Card on Teacher Training

Every year, the Tennessee Higher Education Commission issues a Report Card on the state’s teacher training program. To evaluate educator effectiveness, THEC uses the Tennessee Value-Added Assessment System.

Which effectively renders the Report Card of little value.

Not included in the report is a teacher’s overall effectiveness score on the TEAM model. That would include both observed scores and value-added data, plus other achievement measures. That would be a more robust score to report, but it’s not included.

I’ve written before on the very limited value of value-added data.

Here are some highlights of why we learn almost nothing from the THEC report in terms of whether or not a teacher education program is actually doing a good job:

Here’s the finding that gets all the attention: A top 5 percent teacher (according to value-added modeling or VAM) can help a classroom of students (28) earn $250,000 more collectively over their lifetime.

Now, a quarter of a million sounds like a lot of money.

But, in their sample, a classroom was 28 students. So, that equates to $8928.57 per child over their lifetime. That’s right, NOT $8928.57 MORE per year, MORE over their whole life.

For more math fun, that’s $297.61 more per year over a thirty year career with a VAM-designated “great” teacher vs. with just an average teacher.

Yep, get your kid into a high value-added teacher’s classroom and they could be living in style, making a whole $300 more per year than their friends who had the misfortune of being in an average teacher’s room.

If we go all the way down to what VAM designates as “ineffective” teaching, you’d likely see that number double, or maybe go a little higher. So, let’s say it doubles plus some. Now, your kid has a low VAM teacher and the neighbor’s kid has a high VAM teacher. What’s that do to his or her life?

Well, it looks like this: The neighbor kid gets a starting job offer of $41,000 and your kid gets a starting offer of $40,000.

So, THEC uses a marginal indicator of educator effectiveness to make a significant determination about whether or not educator training programs are effective. At the very least, such a determination should also include observed scores of these teachers over time or the entire TEAM score.

Until then, the annual Report Card on teacher training will add little value to the education policy discussion in Tennessee.

For more on education politics and policy in Tennessee, follow @TNEdReport