GothamSchools — daily independent reporting on NYC public schools

human capital

Wide margins of error, instability on city’s value-added reports

Some English Language Arts teachers received high "value-added" scores in 2007 but much lower scores in 2008.

The value-added reports meant to measure city teachers’ effectiveness have wide margins of error and give judgments that fluctuate — sometimes wildly — from one year to the next, a new analysis finds.

Schools Chancellor Joel Klein has instructed principals to use the Teacher Data Reports as one way to decide which teachers should receive tenure. Teachers who teach English or math to students in grades three through eight receive the reports.

The NYU economist Sean Corcoran found that 31 percent of English teachers who ranked in the bottom quintile of teachers in 2007 had jumped to one of the top two quintile by 2008. About 23 percent of math teachers made the same jump.

There was an overall correlation between how a teacher scored from one year to the next, and for some teachers, the measurement was more stable. Of the math teachers who ranked in the top quintile in 2007, 40 percent retained that crown in 2008.

The Annenberg Institute for School Reform at Brown University, which has a history of criticizing the Bloomberg administration, published Corcoran’s findings, which were part of a wider look at the practice of assigning “value-added” scores to teachers based on their students’ test scores.

The analysis explains the difference between what value-added scores of teachers aim to do and what value-added measurements actually do in practice. The dream is to isolate the effect of a teacher on students’ performance from the effect of everything else; the reality is that the measures approximate that isolated effect with statistics, weak tests, and small sample sizes.

Corcoran offers some praise. “The simple fact that teachers and principals are receiving regular and timely feedback on their students’ achievement is an accomplishment in and of itself, and it is hard to argue that stimulating conversation around improving student achievement is not a positive thing,” he writes. “But,” he writes,

teachers, policymakers, and school leaders should not be seduced by the elegant simplicity of “value-added.”

The weaknesses of value-added detailed in the report include:

  • the fact that value-added scores are inherently relative, grading teachers on a curve — and thereby rendering the goal of having only high value-added teachers “a technical impossibility,” as Corcoran writes
  • the interference of imperfect state tests, which, when swapped with other assessments, can make a teacher who had looked stellar suddenly look subpar
  • and the challenge of truly eliminating the influence of everything else that happens in a school and a classroom from that “unique contribution” by the teacher

Another challenge for the teachers and principals charged with using value-added scores for self-improvement is the uncertainty about what each individual teacher’s score actually is. On each teacher’s report, the city pinpoints the percentile ranking that represents how she compares to other teachers of the same subject and grade.

But while this is the ranking that the teacher most likely holds, it’s far from 100 percent certain. Indeed, the economists who make value-added scores can only be very certain that the teacher falls somewhere on a range of percentiles (and even getting that cautious, they’re still only 95 percent certain). This range, as you might remember from statistics, is called the “confidence interval.”

For most teachers, the confidence interval is at least 30 percentage points long. For math and English teachers with only one year’s worth of data, the average length is over 60 percentage points. That’s a range of, for instance, between the 10th and 70th percentile of teachers.

The average confidence intervals that Corcoran reports are in the chart below. You can see that, because the confidence intervals shrink as the sample size grows, they are longest when only a year’s worth of data is available.

Teachers in the Bronx face the least certainty. Corcoran guesses that this is because their students are the most likely not to be measured, thereby lowering the data pool — either because the students are classified as special ed or English language learners, and don’t take the state test, or because the students move from year to year, making data about their growth over time harder to come by.

picture-311

The full report is here and below.
The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice

  • Michael M.

    This should surprise NO ONE. Neither should the likelihood Klein will continue to press on with a flawed system.

    If I’m reading the first graphic correctly, in addition to the wild swing anomalies noted in the text, there’s no greater than a 35% chance a teacher in a given quintile will repeat in the same quintile. The average (by eye) odds of a repeat is only about 20%.

    And, missing from the bullet list of weaknesses, imho:
    * The challenge of eliminating everything that happens OUTSIDE the school.
    * The challenge of “carry-over”: an inspirational teacher’s positive impression on a child might provide some momentum into the following year, and vice versa.
    * Chemistry. I’d guess that there are some consensus terrific teachers, and some consensus to-be-avoided teachers. But for the great majority, teacher-kid chemistry matters, as does the MIX of kids in a class as set by the school administration.

    I am also struck by the confidence intervals, in that the order of boros is the same for both ELA and Math. Given how erratic the other results are, this is by contrast a seeming steady anomaly.

    P.S. Parent rankings of teachers would be much more stable — and much more trustworthy, imho. As usual, parents are the missing value-added variable in NYC.

  • Peter

    * As more lomgitudinal data is collected the reliability of the data will also increase,

    * To use the data for 20% of a teacher’s evaluation for the 11-12 school year presents the State with a serious issue, any rating based on the data could be open to serious criticism.

    * In NYC the 10-11 data will be used in the eleven “transformation” schools.

    * Looking at the qualities of the teachers in the top and bottom quintiles might be the most useful application of the data

  • Floyd1976

    So, with value added, even with 3 years data, the best one can hope for is a 30% ish range?  Are you serious.  So, if we are starting to use this to judge teachers, will people be alright with me giving my students tests that could be as much as 60% off in assessing their mastery, but, by the end of the year, it will be 30%?  That would mean that a student in my class that got a 100% could be at the same level as someone who had a 70%.  No student or parent would accept this method of assessment.  
    I do not understand why so much money is being put into something that is so consistently being proven to be so insanely inaccurate.  

  • Michael M.

    Value-added is like subliminal advertising in the 60′s – 70′s: it may not have sold soda, but to the ad agencies, it sold ad campaigns.

  • Pingback: HOW TO MAKE CHOCOLATE CAKE | Clever Camel

  • Pingback: How to make Fireworks in Gmod 10 | Clever Camel

  • Pingback: Potassium Nitrate Smoke Bomb | Clever Camel

  • Bronx teacher-lady

    The results of this study need to be given SERIOUS ATTENTION. Teachers are already being denied tenure, and soon many will be evaluated on value-added data.

  • Matthew Ladner

    The first question to ask is just how much stability one should expect. Having said that, stringing together 3 end of year state exams is simply an evolutionary step in doing value added assessment, not the end goal. There are schools doing this is a far more productive fashion- giving monthly exams of common assessment items developed by the teachers themselves based upon state standards. More data = less error, and the teachers own the process.
    http://jaypgreene.com/2010/09/02/nyt-on-la-times-value-added-bombshell/

  • Michael M.

    ML,

    Let me flip your rhetorical point: If there is such little stability, why so much credence?

    Taking a kid’s temperature more frequently — with a busted thermometer — won’t get him healthier faster.

    Your prescription — more frequent use of flawed student tests used to torture the data, if not the teachers — confuses precision (illusory), accuracy (none, given poor repeatability), and error (plenty).

    No rock star pay for gar(b)age band testing. Sheesh.

  • Matthew Ladner

    Michael M-
    Who said anything about using state tests? The state of the art on this is teachers working together in professional learning communities at the department level and developing agreed upon assessment items based on state standards. You can develop a rolling thunder value added data system that the teachers own themselves.

  • Peter

    Mathew:

    Periodic assessments are not a new invention, the beginning of the period quiz, the Friday exam, the end of unit exam, homework, and student responses during each lesson are assessments. As a teacher I ask myself why wasn’t my lesson successful with a cohort of students/ What can I do differently? and, the frustration, what has the student not done?

    Student attendance correlates highly with student success, student homework correlates with achievement.

    If we disaggregate students by attendance, homework and behavior, no surprise … it is these non-cognitive skills that presage test scores.

    I met with a group of middle school teachers, their Data Report scores, 1. varied significantly from year to year, 2. teachers who taught higher achieving students had significantly higher scores than teachers teaching lower achieving classes. (i.e., “integrated” gen/spec ed classes, etc.)

    Yes, we should focus on outliers, for the teachers within two standard deviations of the mean, the data is not useful.

  • Michael M.

    Point well taken, and I’ll assume you got mine, your caveat that I missed the fork in the road notwithstanding.

    You probably caught the recent NYTimes article on “how people learn,” etc., suggesting the value of pop quizzes, etc. as a memory retention aid.

    But I dare say your teachers’ circle is not using their in-department in-school student quizzes to determine which of their own circle should face the firing squad. Would not a teacher feel conflicted if the grade he/she gave a student put her OWN neck on the line?

    And to split a fine hair, there’s a difference between “state of the art” and “experimental.” The former implies proven value-added, if I may. The latter acknowledges, that such is as yet unproven. I’m all for both, actually, as long as they’re each appropriately advertised.

  • Matthew Ladner

    Peter-you are entirely correct that quizzes and formative assessments are old ideas, but bread and meat were old ideas when the Earl of Sandwich had the lightbulb go off in his head as well. If schools develop their common assessment items and use a data system to examine them over time, it becomes easier to detect and attempt to remediate problems. Data goes up, error falls, teachers own the process, students learn more.

  • Peter

    Mathew:

    Do they “learn more”? Is there evidence that schools that are heavy users of data are more successful schools? From what I understand “high click” schools (using ARIS) have not shown significantly better results.

    Did u see the timely op ed in today’s NYTimes?

    http://www.nytimes.com/2010/09/20/opinion/20engel.html?_r=1&ref=opinion

  • Smith

    Should I hold my breath waiting for the newspapers to report on this study?

  • Matthew Ladner

    I can’t help but think when I read someone like the NYT writer waxing poetic about nuanced evaluations of learning that around a third of American 4th graders are functionally illiterate and that American Seniors rank near the bottom in international exams. Despite this, we get treated to regular sermons about how standardized tests don’t really capture the true genius of today’s young American scholar. Oi vey.

  • Michael M.

    … and yet too many testaholics are FINE with the randomness of test outcomes, the randomness of value-added outcomes, and the myopic focus of both.

    And it’s Oy. Which cost you a “4″. It’s sometimes as fickle as that.

  • Pingback: Newest Educational Articles – Robert A. Ferrell

  • Pingback: Sad story about our assessment-crazed school system |

Tips, questions, feedback?

Contact us at .

Follow GothamSchools

RSS

Chalk It Up

Recent Comments

17 comments so far today

Our Twitter Updates

Archives

May 2012
M T W T F S S
« Apr  
 123456
78910111213
14151617181920
21222324252627
28293031