GothamSchools — daily independent reporting on NYC public schools

skoolboy
Aaron Pallas

New York vs. New York

Earlier this year, Brooklyn Assemblyman James Brennan issued a report arguing that the educational reforms formulated by Mayor Michael Bloomberg and Chancellor Joel Klein were not as influential as the policies and practices that preceded mayoral control.  Although test scores have risen significantly over the past decade, the report claimed, most of the improvements predated mayoral control.  Moreover, in some cases, the progress during the Bloomberg-Klein era has been slower than what was observed before.  The report’s conclusions call into question claims about the impact of mayoral control, a hot topic for the next few months and perhaps beyond.

Andy Jacob, overworked and underpaid Department of Education spokesperson–I’m actually not kidding about this–countered with a memo seeking to refute Brennan’s report.  The key claim:  The test score gap between New York City and the rest of the state has closed more under mayoral control than it did before mayoral control, by a substantial margin.  The source of the evidence?  State tests in reading and math in both the fourth and eighth grades.

New York is a big and diverse state, and it’s not immediately obvious why comparing New York City to the state is a good idea.  One rationale is that this comparison controls for year-to-year variation in the content or difficulty of the state exams.  Thus, if the average test score gap between city and state were to shrink over time, we could discount the explanation that it’s just a function of the properties of the test.

As usual, skoolboy prefers to look at scale scores rather than the percentage of students who are proficient;  the scale scores carry more information about student performance, and are less vulnerable to the distortion that high-stakes accountability can elicit.  I looked at scale scores in fourth grade reading and math from 1999 to 2008, comparing the rate of improvement in New York City to growth across New York state overall.  In particular, I examined whether the extent to which the gap between city and state closed over this period differed for 1999 to 2003, prior to mayoral control, and 2003 to 2008, during the Bloomberg/Klein era.

The picture is not as clean as Andy Jacob suggests.  The first figure below shows trends in performance in 4th grade and 8th grade reading.  The vertical line separates 1999-2003 from 2003-2008.  In 4th-grade reading, the gap between New York City and New York state closed by 25% during 1999 to 2003, but only 10% between 2003 and 2008–a slower rate of closing the gap during the Bloomberg/Klein era.  In 8th-grade reading, the story is worse:  the gap between city and state increased by 2% between 1999 and 2003, but increased by an additional 16% between 2003 and 2008.  Nothing for the DOE to crow about here!

reading-gapThe second figure shows trends in 4th grade and 8th grade math performance.  Here, the DOE claims hold up.  In 4th-grade math comparisons between city and state, the gap shrank by 43% between 1999 and 2003, and by 57% between 2003 and 2008-a higher rate of closure during the mayoral control era than before.  Similarly, the city/state achievement gap in 8th-grade mathematics closed at a faster rate between 2003 and 2008 than it did between 1999 and 2003.

math-gapUnfortunately for Andy, he didn’t look at the National Assessment of Educational Progress data.  New York City has been participating in the Trial Urban District Assessment (TUDA) since 2002, which generates NAEP scale scores for New York City and a number of other large urban districts that can be compared to the New York state NAEP scores.

In 2003–relatively early in the Bloomberg/Klein run–fourth-graders in New York City scored 12.3 points lower, on average, on the NAEP reading assessment, and 9.6 points lower, on average, on the NAEP math assessment, than fourth-graders in New York State overall.  Although these gaps shrank by two or three points between 2003 and 2007, the shrinkage is not statistically significant.  In other words, there is no persuasive evidence that the reading and math performance gap between fourth-graders in New York City and fourth-graders in New York state declined between 2003 and 2007, prime-time years for the Bloomberg/Klein reforms.

The picture is much the same at the eighth-grade level.  New York City eighth-graders started out in 2003 13.5 points behind their state peers in reading, and 14 points behind in math.  In 2007, the reading gap had actually increased by a point, and the math gap had shrunk by three points.  Once again, though, these changes between 2003 and 2007 were not statistically significant.  We cannot conclude that the reading and math performance gaps between New York City eighth-graders and eighth-graders across the state declined between 2003 and 2007.

Bottom line:  On both the New York state reading test and the NAEP reading assessment, there is no evidence that the achievement gap between New York City and New York state overall closed faster during the Bloomberg/Klein era of mayoral control than the period preceding mayoral control.  In mathematics, the city closed the gap faster under mayoral control than in the preceding period on the New York state math assessment, but the performance gap on the NAEP assessment has not shrunk appreciably during the Bloomberg/Klein administration.

17 Comments

Subscribe to comments with RSS or TrackBack

  1. eduwonkette

    Great post, SB. It still amazes me how misleading proficiency rates can be, and what you have here is a stellar example of that principle.

    I am also surprised that given the 57% gap closing for 4th grade math between state and NYC from 2003-2008 that we don’t see a similar closing on the NAEP. I can see plenty of reasons why they may not match up perfectly - but these are two *very* different stories, and does make one concerned that these skills aren’t transferring to other tests.

    And finally, an idea - you might also do a similar comparison between NYC TUDA and the other cities. Klein often says that the NAEP is less reliable because it is a small sample, but it is a small sample across the board and the fact is that cities like Atlanta have been making substantial progress while we haven’t.

  2. And what about the in-town Achievement Gap between Whites and Asians; and Blacks, Hispanics, Special Ed, and ELL… W-I-T-H-I-N the city?

    Widening or closing over similar periods?

    Note: CECD2 has disagreggated Performance Index (PI) data supporting the above groupings, for D2 only. In D2, income as a factor appeared to be a wash. See link for the CECD2 resolution on the general topic.

  3. skoolboy

    Michael M.,

    You can see my analysis of whether NYC was closing the achievement gap among racial/ethnic groups here: http://www.nysun.com/files/pallasmemo.pdf

  4. Michael M.

    Rock on SB! Rock on Eduwonkette!

    Time for a spinoff of SB’s September ‘08 classic: “Could a Monkey Do a Better Job of Predicting Which Schools Show Student Progress in English Skills than the New York City Department of Education?”

    Hint: “Score: Monkey 6, DOE 0.”
    (http COLON //blogs DOT edweek DOT org/edweek/eduwonkette/2008/09/could_a_monkey_do_a_better_job DOT html)

  5. SB,

    Thanks. Will share with CECD2.

    Adding to my comment above re CECD2 looking at PI as one metric by which to measure the Achievement Gap, we recognized that the formula itself for calculating PI has a bias: it treats 3’s and 4’s the same, which loosely put, for D2, is more likely to “cap” the performance of the Whites on ELA and Math, and Asians on Math than other groups on either measure.

    Given such a “3 is good enough” formula, I assert the PI understates what may more precisely be called the Proficiency Gap. It’s more like an under-performance index. (And personally, I am concerned it may result in classrooms that neglect 3’s by putting the emphasis on raising a 1 to a 2, or a 2 to a 3. Raising a 3 to a 4 simply does not boost PI.)

    I recognize your Aug 08 Memo used scale scores (which I believe have no such high-end compression). I now wonder how PI — used by NYS, and by the city for NCLB reporting, and readily available for every school on its DOE website — would compare to your analysis in any given year, as well as trend over time.
    (Click my name for link to CECD2 Resolution on topic.)

  6. David Cantor, NYC Department of Education Press Secretary

    Aaron Pallas’s analysis is flawed.

    Disaggregating NYC from Rest of State
    Pallas compares average New York City scale scores with average New York State scale scores instead of comparing NYC scores to the rest of NYS scores (NYS- NYC). Given that New York City is such a large percentage of the state population, state averages can suppress real differences in gains between NYC and other areas of the state. DOE’s analysis compared NYC to the rest of the state, which is why Pallas’s analysis did not match Andy Jacob’s reports. NAEP data is not available disaggregating NYC vs. the rest of NYS, which is why NYC compares its performance to other Large Cities.

    Proficiency levels vs. scale scores
    Beginning in 2006 the New York State Education Department expanded the ELA and mathematics testing programs to all grades 3-8. At this time the state also re-scaled the grade 4 and 8 tests, changing the scale scores and their corresponding ranges. For example, as a result of these changes, the interval between a score of 650 and 660 in 2003 is not equal to the interval between a score of 650 and 660 in 2008. These changes make it necessary to use standardized scores (z-scores) in order to make scale score comparisons prior to and after 2006.

    NCLB and state accountability systems define assessment performance by proficiency levels. Therefore, NYS state assessments are designed to determine whether students are meeting proficiency. Test items are selected for this purpose. In particular, the state tests include many more items that assess whether students are meeting standards than whether they are exceeding them. (For example, in most grade levels, a student would only have to miss one or two questions to be considered a Level 3. For more information, please refer to the NYS Department of Education’s assessment technical reports.) For this reason, NYS and NYC report their achievement results in terms of the percentage of students at proficiency. (Average scores are more appropriate for tests that are designed to measure achievement equally across all levels.)

    Bias
    Pallas’s entire comparison, in using 2003 as opposed to 2002 as a baseline for measuring achievement trends, is premised on a non-statistical, political judgment. Obviously what happened before this administration influenced this administration’s results, just as we will influence those who follow us, but the usual measurement of an administration’s performance looks at what happened while the administration was in office: in this case, from Sept 02-today. Setting a baseline that forecloses on the administration having any affect on performance during a year it ran the school system isn’t a function of serious analysis; it’s advocacy.

  7. Diane Ravitch

    David Cantor is wrong to claim that 2002 is the correct baseline for analyzing test score trends. The DOE would like to use 2002 as the baseline because the test scores in the spring of 2003 showed a dramatic increase. But these increases registered in spring 2003 occurred before the introduction of any of the Children First reforms. Consider the timeline: The Mayor announces the Children First reforms in January 2003, as the children are taking their tests. The reforms are implemented in September 2003. And now Cantor would like to claim credit for the big increases that occurred before any of the reforms were introduced! Here is a quote from the New York Times (May 21, 2003) describing the response of DOE leaders to the “sharp gains” on the ELA test: “City officials, who might otherwise have been jubilant about yesterday’s results, offered a muted reaction, saying that the gains were not broad enough and that the school system as a whole was still failing at least half the city’s children.” And here from the Times’ story of October 22, 2003 about the math gains: “Fourth graders across the state made stunning gains in their math scores last spring, with even sharper increases in New York City…In the city, news of the gains…elicited cheers among teachers and principals. But not everyone greeted the news so enthusiastically. The suggestion that city schools are on the upswing put Chancellor Joel I. Klein, who is overhauling them, in a tricky position. While the chancellor’s critics pounced upon the higher scores as evidence that the school system did not need such an overhaul, some of his allies acknowledged that he would now be under even more pressure to show gains next spring. Mr. Klein’s reaction to the good news was muted, as it was to news of higher reading scores in the spring.” It is interesting that the gains of 2003, preceding the launch of the Children First reforms, were the only ones to be confirmed by the NAEP tests, as NAEP showed acheivement in NYC to be flat from 2003-2007. No wonder David Cantor and Chancellor Klein are now trying to claim credit for the scores recorded before their reforms were implemented.

  8. First, what a terrific blog, and terrific string! Thank you all.

    In layman, non-wonk, terms, I am struck by the following:

    1) The population of NYC may be 8M, and the population of NYS may be 19M. But if improvement in NYC pushes the carrot farther out the state stick, for the reading gap to stay relatively constant in recent years implies Klein is doing no better than the average Chancellor-equivalent elsewhere, and implicitly, Klein is more than “gaining” on his peers in math.

    2) I get the 2002 vs. 2003 topic, but what of 2004 to present? There is no apparent closing of the reading gap, and in math it’s the same improvement “slope” recently as in the pre-Klein past. (I won’t belabor all the changes in DOE since 2004, but they simply don’t show an impact in the trends.)

    3) Peer review is a healthy part of not only science, but public policy too. I call on DOE to open its books in full to such outside independent review, such as that “advocated” on an albeit different topic by Comptroller Thompson in his May 2008 report on DOE growth planning, or the lack thereof (See link to “Growing Pains”) . Until then, it’s hard to see the statements of DOE itself as anything OTHER than “advocacy.”

    4) Like most parents, I just want the straight poop. Personally, I can no longer trust DOE to issue anything but self-congratulatory reviews. I am an advocate for the kids and have no vested interest in any “side” but theirs.

  9. eduwonkette

    David,
    I don’t understand how your critique of scale scores can square with DOE’s own progress report system, which chops up scale scores into faux proficiency units (i.e. 3.1, 3.2, 3.3, etc). DOE has argued many times that these distinctions - which are derived from scale scores, including scale scores above the cut score for proficient - are meaningful, even going as far to say that a student who scored a “3.3″ last year and a “3.1″ this year did not make a full year of progress. How can scale scores be useful for the progress reports and not useful for comparing NYC’s progress with the state?

  10. Socrates

    Cantor appears to be correct on the scale score issue, though eduwonkette is right to point out a potential inconsistency. Pallas’ skill in interpreting data has always been disappointing to me, precisely because his errors always skew in support of his ideological bias. As Cantor asserts, skoolboy seems to practice advocacy more than analysis.

    Cantor, however, doesn’t do himself any favors with this “the chancellor took over in 2002″ nonsense. Ravitch’s quotes prove that the chancellor (or his people, at the least) agreed at some point with what common sense tells the rest of us: A chancellor can’t implement reforms at the end of the ‘02-03 school year and then claim credit for the scores during that year. If scores were so sensitive to a chancellor’s reforms (in a system with 1 million+ students, no less) that in just a matter of months (or less) the scores dropped that dramatically, we should have seen even more dramatic jumps in each subsequent year, when the reforms had really started to sink in.

    I actually like most of Klein’s reforms, and I think they will pay off once they’ve really started to sink in. It’s understandable that in our impatient political climate the folks at Tweed would feel the need to prove their reforms work, but it’s absurd to think that big systemic changes like the ones they’ve made would turn around a city’s schools in 8 years, let alone half a year. Cantor does his boss no favors by clinging to the absurd 2002 starting point.

  11. eduwonkette

    Hey Socrates, I really have to disagree with you on this scale score point, as most psychometricians and testing researchers advocate for using scale scores rather than proficiency ratings to assess educational progress (see, for example, Andrew Ho, Dan Koretz, Bob Linn, Paul Barton at ETS - Andrew Ho has a great article that clearly lays out the problem with proficiency rates - if you email me, I’ll send it to you.)

    Also, remember that this is a comparison of NYC with NY State’s scale scores. If the test cannot capture different levels of skill well above the cut score for a 3, then one would think that this analysis would favor NYC as there are many parts of the state that are more likely to have students close to the ceiling. And furthermore, the NAEP *is* designed to capture a wide range of achievement and has always been tracked longitudinally using scale scores; when you compare NYC with the state or even other cities in the NAEP Trial Urban District Assessment as Cantor would prefer, there are cities that have made substantial progress since 2003 while NYC has been stagnant.

    In general, what I find frustrating about the way that the DOE responds to criticism is that they argue “the analysis is flawed” without ever presenting analyses that demonstrate that the results would be different if, for example, you compared NYC with NY state exclusive of NYC, or if you used standardized score (z scores) instead of continuous scale scores. The burden is on the critic to pony up, but consistently DOE has taken the intellectually lazy position of personally attacking critics.

  12. Greg

    Eduwonkette’s position is fascinating “The burden is on the critic to pony up” Really? Then what do SkoolBoy and Eduwonkette propose we do to improve public schools in the city? go back to a system with a chancellor every 2 years an an unaccountable school board, 32 local sups, and 2% voting rates for local school boards? how would we even start to measure that system! the fact that we can actually look at the Chancellor’s reforms (be they 2002 or 2003) as a longitudinal set of reforms is in and of itself a victory.

    Seriously, What do Aaron and Jennifer propose to improve the schools given limited financial and human talent resources? I’m sure their statistical analysis is excellent, but what has it shown them in terms of policy proposals that will make real change for real kids? Again if “the burden is on the critic to pony up,” then what do you suggest we do instead of Children First?

  13. Michael M.

    Eduwonkette notes the ad hominem nature of DOE rebuttals. And another commenter piles on. Sheesh.

  14. Aaron Pallas

    I had hoped to pull together some new analyses before responding to David Cantor, but it’s taking a bit longer than I expected, and, not surprisingly, blogs abhor a vacuum. So here’s an interim response.

    Cantor is correct that the changes that were introduced into the state’s 4th grade and 8th grade tests in 2006 create a discontinuity in the tests administered before and after 2006. Until 2006, the 4th grade tests included some content that was at the 2nd and 3rd grade levels. With the introduction of state tests in grades 3, 5, 6 and 7 in 2006, the 4th grade test became more difficult. Analyses indicate that the 8th grade test did not become more difficult in 2006. It’s true that there was no direct equating of the 2006 and 2005 assessments, but the scores were linked through an equipercentile linking procedure. Because the means and standard deviations of the 2005 tests and the 2006 tests on the 2005 scale were virtually identical, I don’t think that the discontinuity has distorted the NYC-NYS comparisons. I’m more worried about variability from year to year in the standard deviation of the scale scores, which I suspect stems from the use of number-right scoring to derive the scale scores. But I’ll concede the point, and will seek to address it in my next analysis of these data.

    As for the separation of New York City from New York State, Mr. Cantor should be careful for what he wishes for. Since New York City scores are lower than New York State scores overall—for every grade level, at every time point—a little bit of algebra will show that the New York State scores with NYC removed will be even higher than the NYC scores at every time point. And a little bit more algebra will show that, unless the fraction of New York State 4th and 8th graders from NYC has changed substantially over time, this will result in a smaller shrinkage over time in the gap between NYC and the rest of the state. I’ll be addressing this too.

    As for the claim that my analysis using 2003 as a baseline is premised on a “non-statistical, political judgment,” let me clarify my reasoning. This choice is certainly non-statistical—I don’t know what a statistical judgment about a baseline for an impact assessment would consist of. Is the judgment political? I don’t think so, if by this Cantor means that my political values are dictating this analytic choice. Rather, I contend that the choice is rooted in my professional judgment as someone who has been studying the impact of school reform policies and practices for more than 20 years. Over this period, I have come to believe that understanding the impact of reform initiatives depends heavily on two things: (a) knowledge about the timing and extent of the implementation of the reforms, and (b) a plausible theory of how the reforms might be expected to produce the observed outcomes.

    Diane Ravitch and my intellectual compadre Socrates have already made arguments about the first of these above—whatever the reforms that Joel Klein brought to New York City, there was not enough time for them to be implemented effectively between his start and the mid-year administration of the standardized tests at issue here. But I’d like to take things one step further, and examine the content of the 2002-03 reforms. In a spirited debate with Sol Stern played out on the eduwonk blog last summer, Deputy Chancellor Chris Cerf described these reforms as follows:

    “The changes they instituted include creating the Office of School Safety; the appointment of a new senior staff; a pay-for-performance system for all 40 community and high school superintendents; and the appointment of two community superintendents in predominantly African-American communities who generated significantly improved results. And while I’m not sure how to parse the effect on student achievement, the Mayor and Chancellor introduced an idiom of high expectations and accountability to discussions of student learning.”

    If there is existing evidence that reforms of this sort have been found to produce relatively immediate, sharp upswings in students’ performance on standardized tests, I would hope that David Cantor would point me to it. In the absence of such evidence, I’ll stick with my professional judgment—not statistical, but also not political—that 2003 is an appropriate baseline for evaluating the impact of the school reform policies and practices introduced by Mike Bloomberg and Joel Klein.

    A final comment for Mr. Cantor: stating that an analysis is flawed is not the same as demonstrating that the conclusions of the analysis are incorrect. Researchers make analytic choices all the time, and I can’t think of a piece of educational or social research that I would characterize as being without flaws. What’s at issue is whether other analytic choices that a scholarly community deems to be as good or better lead to different conclusions. The DOE response would be more powerful if it demonstrated that other ways of addressing issues such as the 2006 changes in the state tests or the comparison of NYC with the rest of New York State led to substantially different results.

  15. If its indeed true that the 4th and 8th grade NY state tests changed substantially in 2006, this is yet another reason to rely on the NAEPs to give us a more reliable evaluation of progress, since the NAEPs have been stable over time. And we know what the NAEPs show in terms of NYC’s lack of progress since 2003.

  16. Re Deputy Chancellor Cerf’s (per Aaron Pallas), “…a pay-for-performance system for all 40 community and high school superintendents…”
    Please see CECD2’s Policy Analysis on the role of the Superintendent (see link).

    Excerpt: “As is the case in all districts throughout the Department of Education, our Superintendent has been deployed to schools outside of her home district — District 2 — to serve as Senior Achievement Facilitator (“SAF”). Upwards of 90% of the Superintendent’s time is spent on her SAF duties, coaching inquiry teams outside of our district on how to use data-driven instruction to improve instruction, student achievement, and test scores, leaving little time to fulfill her duties in our district.”

    So I ask: based on what performance is any superintendent’s pay based — software trainer?

    The Chancellor has long tried to abolish District Superintendents. I believe this has been challenged in court. As I understand, the plaintiffs won. But it was a Pyrhhic victory — Klein has simply assigned them other duties 90% of the time, and effectively replaced them with School Support Organizations that he forces the Principals to pay for.

  17. Oop. “Pyrrhic.” Darn public school education. ;-)

Leave a Reply

Tips, questions, feedback?

Contact us at .

Mapping the Budget Cuts

Post a comment about the budget cuts at your school on our interactive comment map. more »

Chalk It Up

Our Twitter Updates

  • That was anticlimactic: Chancellor Klein just announced that school is closed tomorrow. Go stock up on cocoa now! 1 hr ago
  • What are odds that tomorrow will be a snow day in NYC schools? Mayor Bloomberg is holding a 1 p.m. presser to discuss the city's snow plan. 2 hrs ago
  • Citywide Council on High Schools meeting is set to proceed as scheduled, for now. Same goes for the PEP meeting rescheduled from Jan. 26. 20 hrs ago
  • From the DOE: In anticipation of inclement weather, the Specialized High School open houses scheduled for Weds. have been postponed. 20 hrs ago
  • @datadiva What do you see as the biggest changes? We're having trouble figuring out what to make of the 2010-2011 changes. in reply to datadiva 21 hrs ago

Events Calendar

Archives

February 2010
M T W T F S S
« Jan  
1234567
891011121314
15161718192021
22232425262728

GothamSchools by Email

Technology in Education

The blogroll is a work-in-progress; to be added or if you've been miscategorized, send us an email at .