GothamSchools — daily independent reporting on NYC public schools

Eye on Education

Rigor Mortis And Measurement Error In New Evaluations

The word rigor comes up a lot in teacher-evaluation systems. It’s akin to motherhood, apple pie and the American flag. What policymaker is going to take a stand against rigor? But the term is getting distorted almost beyond recognition.

In science, a rigorous study is one in which the scientific claims are supported by the evidence. Scientific rigor is primarily determined by the study’s design and data-analysis methods. It has nothing to do with the substance of the scientific claims. A study that concludes that an educational program or intervention is ineffective, for example, is not inherently more rigorous than one that concludes that a program works.

In the current discourse on teacher-evaluation systems, however, an evaluation system is deemed rigorous based either on how much of the evaluation rests on direct measures of student-learning outcomes, or the distribution of teachers into the various rating categories, or both. If an evaluation system relies heavily on No Child Left Behind-style state standardized tests in reading and mathematics — say, 40 percent of the overall evaluation or more — its proponents are likely to describe it as rigorous. Similarly, if an evaluation system has four performance categories — e.g., ineffective, developing, effective and highly effective — a system that classifies very few teachers as highly effective and many teachers as ineffective may be labeled rigorous.

In these instances, the word rigor obscures the subjectivity involved in the final composite rating assigned to teachers. The fraction of the overall evaluation based on student-learning outcomes is wholly a matter of judgment; and if you believe, as I do, that a teacher’s responsibility for advancing student learning extends well beyond the content that appears on standardized tests, you could conceivably argue that increasing the weight given to standardized tests in teacher evaluations makes these evaluations less rigorous. This is, however, a hard sell in the absence of other concrete measures of student-learning outcomes that could supplement the standardized-test results.

Even more importantly, describing a teacher-evaluation system as rigorous hides the fact that the criteria for assigning teachers to performance categories — either for subcomponents or for the overall composite evaluation — are arbitrary. There’s no scientific basis for saying, as New York has, that of the 20 points out of 100 allocated for student “growth” on New York’s state tests, a teacher needs to receive 18 to be rated “highly effective,” or that a teacher receiving 3 to 8 points will be classified as “developing.” In fact, the cut-off separating “developing” from “effective” changed last week as a result of an agreement reached between the New York State Education Department and the state teachers’ union — not because of science, mind you, but because of politics.

And it’s politics, and politics alone, that accounts for the fact that the rules for the overall composite evaluation say that any teacher who scores 0 to 64 points will be classified as ineffective, and that the two subcomponents for student “growth” and local assessments, each of which counts for 20 points, classify teachers who score 0 to 2 points on each component as ineffective. This means, as Long Island principal Carol Burris and others have pointed out, that if a teacher is classified as ineffective on both of these subcomponents, that teacher is automatically rated ineffective overall, even if that teacher is rated highly effective on the 60 points allocated for measures of a teacher’s professional practices. It certainly seems odd that two components accounting for 40 percent of a teacher’s overall rating can trump the remaining 60 percent — but this isn’t science, it’s politics.

Other states face the same challenge in assigning teachers’ value-added scores or student growth percentile scores to performance categories, and most of them have punted, issuing regulations that defer these difficult decisions until later. Illinois says that it’s “working diligently” on this. Georgia claims that its model will be identified soon. Michigan is counting on a rating system to be developed by the Governor’s Council on Educator Effectiveness. After a year of debate, Delaware concluded that it couldn’t figure out how to use students’ scores on the state assessment system in teachers’ summative ratings for the 2011-2012 school year, and deferred implementation until the future.

It violates a basic principle of fairness for teachers to be held accountable for performance criteria that aren’t clearly specified in advance and that may be unattainable. These states, and many others, have their work cut out for them.

Nowhere is this more evident than with the mapping of teachers’ value-added or student growth percentile scores onto the ratings composing a teacher’s summative evaluation. The value-added or student growth percentile scores are measured with errors that can be substantial, especially when they are based on a single year’s worth of student achievement data. But the scoring bands for ratings categories such as “developing” or “effective” have strict cut-offs. What to do?

One way of reclaiming the concept of rigor in teacher-evaluation systems is to assign ratings that take into account the uncertainty or errors in the measures. This is consistent with a scientific conception of rigor: the assignment of teachers to rating categories should be consistent with the quality of the evidence for doing so. A teacher shouldn’t be assigned a rating of “ineffective” based on a value-added score, for example, if there’s a substantial probability that the teacher’s true rating is “developing.”

So here’s a challenge, and a proposal. The challenge is to state education policymakers across the country who have hitched their teacher-evaluation systems to measures that seek to isolate teachers’ contributions to their students’ learning: Develop clear and consistent guidelines for assigning teachers to rating categories that take into account the inherent uncertainty and errors in the value-added measures and their variants.

And here’s the proposal: A teacher should be assigned to the lower of two adjacent rating categories only if there is at least 90 percent confidence that the teacher is not in the higher category. Operationally, this involves a statistical test based on a cut score, a teacher’s score and the error associated with that score.

Suppose, for example, that the cut-off separating “ineffective” and “developing” is a teacher being in the 10th percentile across the state on a value-added or student growth percentile measure. Teacher A’s percentile rating is the eighth percentile, but the standard error for her rating is two percentile points. Given the uncertainty in the rating, there is a 16 percent probability that Teacher A’s true percentile rating is greater than the 10th percentile, and an 84 percent probability that her true percentile is lower than the 10th percentile. Thus, in my proposal, Teacher A should be classified as developing, not ineffective.

Conversely, Teacher B’s percentile rating is in the fourth percentile, and the standard error for her rating is three percentile points. Given the uncertainty in the rating, there is only a 2 percent probability that Teacher B’s true percentile value is above 10, and a 98 percent probability that his true percentile rating is lower than the 10th percentile. Teacher B would therefore be classified as ineffective.

Other approaches are certainly viable; the 90 percent confidence rating is arbitrary, but one that seems sensible to me. In most educational, social and medical research, a common standard is to trust an observed effect only if that effect could be observed by chance under 5 percent of the time, relative to the hypothesis that there’s no true effect in the population. The 90 percent standard I’m proposing is slightly more lenient. And of course this approach doesn’t address the arbitrariness in the New York scheme described above.

If policymakers aren’t willing to take measurement error into account in a defensible way in teacher-evaluation systems, don’t talk to me about rigor — rigor is dead.

This post also appears on Eye on Education, Aaron Pallas’s Hechinger Report blog.

  • Fred Smith

    Thank you for this contribution, Aaron.
     
    As you pointed out, various schemes have been/are being devised using test scores to evaluate teachers. On the surface the numbers provide a false sense that objectivity, quantitative reasoning and scientific precision are guiding the rating process.  Peek behind the curtain and find the elaborate grading systems rest on arbitrary decisions born of numerology and political calculation.

    But the politicians are clever.  Just as they have misused flimsy data to serve their purposes, so too have they corrupted language and seized upon slogans to mislead. 
    The  words “accountability,”  “transparency,” “standards” and “value-added” come to mind.   And who can argue with No Child Left Behind, Children First and  Race to the Top.  As you point out–add “rigor” to the list.  Did someone say Orwell!

    Let me just extend how “rigor” has been hijacked.  Since the impossibly high 2009 New York State test results were released, the Board of Regents has conspicuously directed its State Education Department to administer more rigorous tests.

    Sure enough, in 2010 and 2011 the statewide and citywide percentages of students deemed to be proficient in reading and math fell precipitously.  These reversals were attributed to tougher exams–when, in fact, the decreases reflected nothing more than an increase in the cut off scores needed to reach the threshhold of proficiency.  The same decision could have been made in 2009 (an election year).

    A closer examination of the data reveals that the items on the test hadn’t gotten any tougher.  Raising the cut off scores was sold to the media as a return to higher standards–but it’s hard to reconcile this claim in light of the easier items.

    The scariest part is that the test instruments, their results and the way they are being weighted (without regard to measurement error) are the foundation upon which students, teachers and schools are being judged.  Given how high the stakes have become, we cannot afford to have testing systems that are so compromised.

    Fred Smith

  • Ken Hirsh

    Interesting stuff.  Aaron, what is the null hypothesis in your proposal with respect to teacher quality?  It seems to me that your proposal is consistent with a primary concern of unfairly judging teachers, rather than unfairly subjecting students to an ineffective teacher.        

    (For what it’s worth, I favor a subjective system with appropriately incented and accountable school leaders.)

  • http://www.facebook.com/aaronpallas Aaron Pallas

     Ken,

    I’m not sure what the notion of a null hypothesis with respect to teacher quality means.  Do I think that teachers vary in their quality?  Sure, as I believe that doctors, lawyers, and other practitioners do.  But I am not persuaded that the current generation of teacher evaluation systems is able to discern meaningful differences.

    The APPR process is a mechanism for appraising the performance of teachers by classifying them into one of four categories, based on a set of criteria and weights that are the result of a political process.  I’m trying to promote transparency and fairness in the application of those criteria and weights to the classification process.  As I think you know, I’d choose different criteria and weights. 

    The notion of “unfairly subjecting students to an ineffective teacher” runs the risk of circularity without a concrete specification of the criteria that define an ineffective teacher.    

  • Vote NO!

     Ken,

    The  irony  of  the  law  drafted  last  week,  is  that  it  will  likely  result  in  far  more  ineffective  teachers  in  NY  classrooms  than  currently  exist.  It  has  severely  undermined  teaching  in  NY state.  It  is  going to  be  a  “tough  sell”  to  convince   college  students  to  expend  the  time  and  finances  for  a  4  year  college  degree  to  enter  teaching.  Now  that  a  teacher’s continued   employment  will  be  determined  by  many  more  factors  beyond  their  control  than  the  factors  they  do  have  control  of.

  • Guest

    It’s cute that you think that the powers that be care about anything other than lining their friends’ pockets.  This about kids, it’s about money.

  • MG

    What is the typical measurement error in value-added evaluations of teachers?  

    Your examples use SEs of 2 points and 3 points.  But I would have guessed SEs are much higher.  Are they?  

  • http://www.facebook.com/aaronpallas Aaron Pallas

     Yes, they typically will be larger.  Sean Corcoran’s Annenberg report analyzing the 2008-09 NYC Teacher Data Reports indicates that for one year of data, the average SE is about 16 points for ELA and 15 points for math;  for three years of data, the average SE is about 9 points for ELA and 7 points for math.  (These standard errors may be smaller for point estimates which are very high or very low.)  New York’s proposal is to use a single year of data to calculate student growth percentiles, and consider a contractor recommendation for a value-added model in the future.  Needless to say, there’s a lot of uncertainty in single-year estimates. 

  • Ken Hirsh

    Good stuff.  What do you have in mind with respect to “a concrete specification of the criteria that define an ineffective teacher”?

Tips, questions, feedback?

Contact us at .

Word from Our Sponsor

Follow GothamSchools

RSS
Subscribe to the daily email digest:

Chalk It Up

Recent Comments

9 comments so far today

Events Calendar

Archives

May 2013
M T W T F S S
« Apr  
 12345
6789101112
13141516171819
20212223242526
2728293031