GothamSchools — daily independent reporting on NYC public schools

guest perspective

Guessing My Way to Promotion

Last week I read a thought-provoking column by Diane Ravitch in the New York Post, in which she discusses the lowering of the bar on New York State math and ELA tests. She points out that to reach level 2, which is sufficient for promotion in New York City, a student needs a significantly lower percentage of points than he or she would have needed three years ago. Ravitch comments. “Ending social promotion, as the city rightly wants to do, is thus meaningless, because students can reach Level 2 by just guessing.”

Likewise, Meredith Kolodner writes in the Daily News, “The number of correct answers needed to score a Level 2 to get promoted has sunk so low that a student can guess on the multiple choice section and leave the rest of the test blank.”

This is disturbing. Surely it isn’t possible to get a 2—and thus a promotion to the next grade—by just guessing! Or is it?

To find out, I conducted a little experiment. First, some background facts:

(a)Each question on each of the tests is worth a certain number of points. The total number of points earned on a given test is the raw score.
(b) Each test has its own conversion table for converting the raw score to a scale score.
(c) The conversion from scale score to proficiency level is different for each grade and subject (though 650 is the minimum for a level 3 across the board).
(d) Thus, to find out if a student got a 2 on a test, you have to (1) correct the test, (2) calculate the raw score, (3) convert it to scale score, and then (4) convert the scale score to proficiency level.

The New York State Education Department website has all the tests, scoring keys, and tables you need.

Now follow along with me as I reproduce the experiment. My question was: is it possible to get a 2 by just guessing?

I first tried my experiment with the sixth grade ELA test. I “guessed” all the answers on the multiple-choice portion and left the written portions blank. Or, rather, I didn’t “guess,” but filled in the answers as follows: A, B, C, D, A, B, C, D, and so on, all the way through the 26 questions. I didn’t read one of them.

Now, of course I got a zero on the written portions, but let’s see if I got enough points on the multiple-choice questions alone. To find out, I first consulted the scoring key. The total number of possible raw points is 39: 26 for the multiple-choice questions, and 13 for the written portions I scored myself. Remember that I answered A, B, C, D, A, B, C, D, etc. According to the key, I earned 12 points.

Now I went to the raw to scale score conversion chart to calculate my scale score (if you are following along, be sure to click on the “Grade 6” tab at the bottom of the chart). According to this chart, my scale score was 622.

From here I consulted the “Definitions of Performance Levels for the 2009 Grades 3-8 English Language Arts Tests.” According to this table, a sixth grader needs a scale score of 598-649 in order to attain level 2. My score fell within that range, so I got a 2 without looking at a single test question or writing a single word.

I thought to myself: What if the sixth grade ELA test were a fluke? I tried the same experiment with the seventh grade math test.

Again, I only worried about the multiple-choice questions. Actually I didn’t worry about them at all; I simply answered them A, B, C, D, A, B, C, D, as I had done with the sixth-grade ELA test. There were thirty such questions.

I then went to the scoring key. I scored 11 raw points. According to the raw score to scale score conversion chart, this gave me a scale score of 616 (be sure to scroll down to the 7th grade chart). Is that enough for a 2? I consulted the “Definitions of Performance Levels for the 2009 Grades 3-8 Mathematics Tests.” A seventh grader needs a scale score of 611-649 on the math test to get a 2. So I got a 2 without solving a single math problem, or even looking at one.

While this approach does not result in a 2 for all the tests, it comes a bit too close for comfort, and another guessing system might work. A fifth grader told me that his father had told him, “Just mark ‘C’ for all of the answers, and you will pass.” On the fifth grade ELA test, this would indeed have resulted in a 2.

Yes, it is possible to guess your way to promotion. You may not even have to look at the questions or write a word on the written sections. It may not be called social promotion, but it amounts to the same thing: You do not need to know or understand much to move along.

Diana Senechal taught in NYC public schools for four years and has stepped back to write a book. She has a Ph.D. in Slavic Languages and Literatures from Yale; her education writing has appeared in Education Week, the Core Knowledge Blog, and Joanne Jacobs.

  • Pogue

    Diane, what a fantastic experiment. It is sad what Bloomberg and Klein are getting credit for while the newspaper editorial boards go right along with them. So, “We guess NYC is keeping it going,” and, “We guess Mayoral Control is working”, and, “We guess kids are actually learning skills for adulthood.” The current system is sad and disgraceful and that ain’t guessing.

  • Michael M.

    Excellent analysis. Thanks.

    Gawd ferbid the town’s parakeet cage liners pick up on this story.

    Does this mean that to Score Level 1, it had to be… intentional? Sort of a protest vote? (kidding)

    Seriously, it would seem that Level 1 kids need something more than a re-run; they need serious help — not election season soundbites. Or Klein appealing to “common sense” on FOX while he SITS ON the RAND study results. Sheesh.

  • http://MoreThoughtful.blogspot.com ceolaf

    This is brilliant.

    Nice job.

  • http://www.grand-rounds.blogspot.com Jennifer

    I had an immediate visceral response to this post but once I re-read it, I realized I was responding to the word “passing” used in conjunction with the state assessments. It’s a relationship that seems to occur exclusively in NYC. I assume that connection occurs as a consequence the policy that you linked to, Diana (thanks for that) and makes perfect sense to reference on a NYC-centric blog. Given that, they are state-wide assessments are written for all students in the state. In many areas, there is hesitancy to frame student performance as “pass/fail” as it implies that there are consequences for the child if they do not do well on the assessment – a consequence (social promotion) that, I think, only occurs in NYC. In fact, the testing guidance document directly states their opinion on the matter: “the State Education Department (SED) advises schools that decisions such as promotion or retention should be based on multiple measures of the student’s achievement and not solely on scores from the New York State Testing Program” (Introduction to the Grades 3–8 Testing Program in English Language Arts and Mathematics. NYSED, 2004).

    I think I was also responding to the potential message the posting might send a young adult or parent who reads the blog – or even finds it when Googling “how to pass the NYS tests”. There is so much noise already around the assessments and their legitmacy, I wonder about the benefits of stating “don’t take it seroiusly, you can just fake your way into success”. Especially when a Level 3 means mastery of the standards, not a Level 2.

    True, the system isn’t perfect, no one is claiming it is. I’ve ranted my fair share about the testing schedule and demands. Also, I’m pretty confident SED is aware of the challenges that have been mentioned. However, given the goal of using multiple measures to make the best decisions possible, I struggle with how blog posts, even those as well-written and logical as this one, contribute to getting us past the assessments or keeping us mired in them. Is it time for us to accept that the criticisms have been heard and move our assessment energy to authentic assessments and multiple measures?

  • http://charterschoolindependent.blogspot.com mathteacher

    As a point of reference: in Massachusetts on the 2009 7th grade math test, the numbers work out like this: There are 54 total points, 20 of which are open response and 5 of which are short answer. This leaves 29 multiple choice questions that I can reasonably guess on without looking at the questions. Assuming I would get about a fourth of the questions correct, that leaves me with 7 points, or there about. To get “Needs Improvement,” the MA equivalent of level 2, one needs to get at minimum 26 points, a little under half. By comparison, you can score level 2 in NY with less than 25% of the total points. My personal opinion, BTW, is that the MA test is a more interesting test. You can find the released items here: (http://www.doe.mass.edu/mcas/2009/release/g7math.pdf)

  • http://blog.coreknowledge.org/2009/08/18/social-promotion-easy-as-a-b-c/ Social Promotion? Easy as A, B, C… at The Core Knowledge Blog

    [...] to see if it’s possible to pass the test by simply guessing.  She posts the results over at Gotham Schools.  I first tried my experiment with the sixth grade ELA test. I “guessed” all the answers on [...]

  • canwetalk

    Could you imagine if pharmaceutical companies were to water down medicine that is used in life-threatening diseases and patients wondered why they are not recovering even though the medicine is touted as the best? BloomKlein, SED and Duncan in their unrelented effort to promote themselves as miracle doctors are seriously effecting the total mental (a mind is terrible thing to waste) health of the education system. It will soon flat line! DOA for the DOE!

  • Michael M.

    CWT,

    Or… they assume you’re getting healthier and infer the efficacy of the medicine has improved. Ergo, they believe the medicine is the opposite of watered down.

    I wish I were making it up. On a related GS article today, the state’s top test oversight guy (Howard Everson) argues, in effect, that… we BELIEVE the medicine is getting stronger (because the state asked that it be made so), therefore you MUST be getting healthier.

    There are so many dimensions to circular logic at play here, we should call it spherical logic.

  • canwetalk

    MM,

    Thank you for your insight on this issue. I read this the other day: “State math exam scores have risen – but it’s because tests have gotten easier”
    BY Meredith Kolodner and Rachel Monahan, DAILY NEWS STAFF WRITERS, Sunday, June 7th 2009. As per Jennifer Jennings, Columbia University doctorate student, “It’s the lesson of the financial crisis, and it’s the lesson here – you can’t just trust the numbers, you have to look at what the numbers mean,” Tisch said she thought rising tests scores were a reflection of improvement at city schools – but “as my grandmother would say, it’s nothing to write home about.” “Who are we kidding? And who’s being cheated? We tell the parent, your kid is a high level 3. A high level 3 on what? On nothing,” Although these test making publishers and educrats and their minions are in this circular logic with many dimensions, or I call it pi in the sky method, yet their deceits can be measured in volume, you and I know that the shortest distance between the truth is a straight line to the facts.

  • Anonymous

    In both cases, that pattern-based approach happened to produce considerably higher scores than would be expected for random guessing: 12/26 in Grade 6 ELA, versus an expected outcome of 6.5. And 11/30 on Grade 7 Math, versus an expected outcome of 7.5.

    But if you flip the subjects by grade, and apply that same “guessing algorithm” to Grade 6 Math and Grade 7 ELA, you get 4/25 (which results in the lowest possible scale score) and 6/26 (level 1) respectively. Those scores are both much more in line with what one would expect from random guessing, although, of course, if lots of students make random guesses then their resulting scores would be distributed in a normal curve. I.e. some kids who just fill in random patterns will indeed get lucky, and end up with higher scores than appropriate.

    Of course, more meaningful results are produced when students are not guessing randomly. A really low-performing student is reasonably likely to score worse than a student picking randomly, because the items feature distracters that are designed to appeal to students who have common misconceptions.

  • Diana Senechal

    Thanks to all for the comments. Anonymous, you make a good point. As it happens, a student needs only 7 raw points on the sixth-grade ELA test to get a 2. I happened to get 12 points, but I didn’t need that many. There are 26 multiple-choice questions on the test, as you indicate, so a student need only get 27 percent of the multiple-choice questions correct (and nothing on the written part) to get a 2.

  • http://MoreThoughtful.blogspot.com ceolaf

    Anonymous makes a great observation, though it does not diminish the larger point — which is that the the design of these tests is rather flawed.

    It is easy enough to correct for guessing, so long as the answers truly do appear in a random order. Ms. Senechal has shown that the answers do not appear in a random order, or at least not for practical purposes.

    Apple faced this problem with its shuffle feature on its iPod. It turned out that truly random did not seem random to its customers because they did not want songs by the same artists to be played consecutively, even if random selection would result in that. In this case, there are certainly some departures from true randomness that any test designer should consider. The A-B-C-D-E pattern is one. I would suggest that there be an equal number of answers for each possible letter choice, rather than random distribution that might result in a few more of one than of others.

    But the bigger issue is the lack of a correction for guessing. It is easy enough to subtract a fraction of the incorrect answers from the score. That is, if their are 5 possible answers (e.g. A, B, C, D, E), subtract 1/4 of the wrong answers. This will results in the gains from sheer guessing being cancelled out by the losses from sheer guessing. (In this example, if someone guessed 25 times, they’d be expected to get 5 right and 20 wrong. Subtract 20/4 from those 5 and your get zero.) This standard sort of correction still preserved some advantage for eliminating some possible answers before guessing.

    Can someone explain why these tests do not correct for guessing?

  • Michael M.

    Question: Does converting from raw score to scale score automatically correct for guessing?

  • http://www.specialeducationmuckraker.com Dee Alpert

    Let me make things even worse.

    The NYCDOE’s procedures for test administration and collection of answer sheets allow those who wish to do so to fill in answer “bubbles” for questions kids can’t answer and haven’t guessed at. The test administrator manuals now say that teachers proctoring these exams should “make certain that students darken bubbles and completely erase stray marks.” http://schools.nyc.gov/NR/rdonlyres/FCB42B11-9499-40FE-927E-E8F032714928/64207/Memo18.pdf.

    There’s something wrong with having teachers look at kids’ test answer sheets before they’re graded – while the kids are still taking the tests. It’s easy enough for teachers to suggest that kids guess at questions they’ve left unanswered while the test is still being administered, or gently hint that a particular answer is wrong.

    Prior years’ versions of these manuals stated that teachers should – AFTER collecting completed answer sheets – eyeball them to remove extraneous pencil marks, which made it pathetically easy to fill in answers kids had left out entirely. These instructions were in effect for the tests whose scores we’re talking about now.

    In a prior audit, NYS Comptroller Carl McCall slammed this process, pointing out how easy it made it for adults to cheat. The Bd. of Ed. rejected McCall’s recommendation that teachers not be asked, or allowed, to do this, claiming that its “erasure” analyses were sufficient protection against adult cheating. There was never a response to McCall’s comment that erasure analysis wouldn’t catch instances where teachers just filled in answers kids hadn’t even tried to guess at. And this was before NCLB high stakes testing went into full force and effect! A recent Comptroller Thompson audit report noted that the NYCDOE has now eliminated even the inadequate erasure analyses it claimed back then as its main bulwark against adult test tampering. But the pressure to get scores up is far greater now.

    There are 2 different sets of issues here:

    1. The laughable “mandated” testing system State Ed. has set up, which is designed to protect as many district and school officials as possible from the opprobrium of being branded as running “flunking” schools and districts, while keeping the flow of federal funds to their coffers uninterrupted. Did you know that an outside firm regrades a sample of these tests? But … if inflated grading is identified, absolutely nothing is done about it. No scores are changed. No School or District Report Cards are changed. No bonuses are taken back when they’ve been awarded, in part or in whole, on reported scores. Nothing!

    2. The NYCDOE’s manipulation both of that system and downright falsifications regarding its students scores.

    Both are significant, but have different implications. When a politician declares that he is to be judged first and foremost according to the test scores of the students in the public education system he runs, then numbers are bullets, and are shaped and aimed very, very carefully for a political war. In a real battlefield, sometimes people can verify body counts to see if claims about numbers of enemies shot dead are accurate or fluff. I challenge anyone, and everyone, to get the real complete NYCDOE dataset for each State Ed.-mandated test including the reported grade enrollment for each NCLB subgroup AND the numbers, by grade and subgroup, who were reported as “absent” to see what the NYCDOE has reported, what it has withheld, and what it has manipulated and spinned.

    Just as important is the flagrant failure of the authority – in this case the NYS Ed. Dept. – which set up a ridiculous testing system in the first place and then explicitly refused to do things to insure that the tests it mandated were properly administered, properly scored and properly reported.

    State Ed. used to make public reports of very detailed analyses of scores on the mandated tests. These included comparisons of the scoring by NYCDOE and NYC private school teachers for the same test items. I say “used to” because the most recent reports State Ed. has uploaded to its web site carefully omit any mention of this information.

    Let me be blunt. For all we know, the exact same test answers which would give a NYC private school student a low Level 1 score may have given a NYCDOE student a high Level 2, or low Level 3 score.

    The devil’s in the details. When you “drill down” to the details, the NYCDOE’s test score reports and analyses are simply weapons in the upcoming Mayoral race. Nothing more: Nothing less. To give them any credibility whatsoever is simply unwarranted.

    Dee Alpert, Publisher
    SpecialEducationMuckraker.com

  • http://MoreThoughtful.blogspot.com ceolaf

    Michael M.,

    There’s no reason to think that it would.

    Imagine this scenario:

    * Two students.

    * Both know 10 answers, are 100% sure of them and they are right.

    * Neither kid has a clue about the other 17 answers.
    – Kid A leaves those 17 blank (i.e. 10 right, 17 blank, 0 wrong)
    – Kid B guesses randomly on all 17, getting 3 right (i.e. 13 right, 0 blank, 14 wrong)

    * In terms of raw scores, Kid A has a 10 and Kid B has a 13.

    What’s the next step? Do we convert those raw scores to scale scores, or do we first do something to correct for guessing?

    This is, essentially, what Ms. Senechal is taking advantage of. We really don’t want guessing to be rewarded in the context, unless the student has already eliminated some of the possible answer (because that would reveal that the kid know *something*). The standard kind of correction accomplishes that, because by rightly eliminating at least one answer increases the expected number of correct guesses.

    But by leaving out a correction for guessing, Kid B will have a substantially higher score than Kid A. In fact, this is a bigger problem at the low end than the high end, because kids at the high end are going to be able to do more questions properly. The fewer questions that kids have confidence in, the more they will be guessing and the bigger this issue becomes. The recent reports that the tests are getting harder (i.e. a lower raw score is needed to get a proficient scale score) means that the benefit for guessing has gotten greater. In other words, in appropriate test prep now makes a bigger difference than before.

    Shocking, I know.

  • Michael M.

    ceolaf,

    And Kid C got 27 right, none blank, none wrong. And gets a 4. And Kid D got 26 right, none blank, and 1 wrong. And gets a 3.
    As does Kid E, who got 15 right this year, and got a 3, but who got 17 right last year and got a 2.

    And if every kid gets a minimum 10 right, isn’t that the scaled zero (or whatever number is the floor)?

    No prob with your math. But you’re reframing the question as a comparison of guessers to blankers. The issue is sorting out the guessers from the kids trying hard but getting it wrong anyway.

    Not shocked yet.

  • http://MoreThoughtful.blogspot.com ceolaf

    Michael M.

    3) The “shocking” comment was meant to be sarcastic.

    2) As for blanks vs. guesses, that does not have the substantive difference that you are putting on it.

    We know that the blanks are blank, and not guesses. They do not factor into this.

    Your question is how we want to treat guessing vs. effort, right? But that’s not the issues. These are not supposed to be test of effort. They are supposed to be test of skill, knowledge and proficiency — which I am lump into “skill” from now on. All that matters is how much skill they have, which we recognize with in correct answers.

    * Kids who work out and get questions right for skill deserve every one of those raw score points. No question.

    * Kids who try to work a bunch of questions and get them all wrong should not get credit simply for their work. If their work consistently leads them astray, they do not have the skill being measured, and should not get points.

    * However, by working them out at least partially correctly, the kids will get some sort of approximation of an answer, allowing them to eliminate one or more possible choices. Thus, kids who have some skill, but not enough, will do more than randomly guess. They will make educated or partial guesses. Thus, they will get more than 1/n (n = number of possible answers) of them right. And the standard correction for guessing will still leave them with some extra points.

    * Kids who randomly guess should receive no benefit.

    So, where is the problem with what you call my “reframing”?

    (There are also issues with how false answers are written. For example, how plausible should they be? How much should the help us to identify common missed steps or the like? But those are other issues than merely scoring, in and of itself.)

    1) The floor. Actually, this year’s students do not set the floor at all. This is a common mistake in trying to understand assessment. Tests are linked across years, so floors are developed over a very long period.

    You are mistaking norm reference tests and standards referenced tests. If scores are reported in percentile (e.g. Kid A is in the 31st percentile), you’d be closer to right. But those percentiles are not supposed to be based just on this year’s results, but rather over longer term result — which is made possible by linking tests over time. But these are not norm referenced tests. These are standards referenced tests, and the cut scores were set arbitrarily (though hopefully not capaciously) by some committee or task force of some sort. Some sort of mapping from raw scores to scale scores was created so that scores reports would look the same year after year, with the technically derived linking to old cut scores creating new raw–>scale conversions.

    It simply is *not* the case that the lowest score on this year’s test maps to a 0 on the scale scores.

  • Michael M.

    ceolaf,
    Even a broken clock is right twice a day. That doesn’t mean the clock was guessing, and has nothing to do with partial credit or getting an A for effort.

    Look, the whole reason the discussion turned to guessing was to show that guessing-plus-one is a Level 2, and is therefore meaningless at assessing the relative proficiency of a Level 1 or 2 kid. The level 2 kid may have just gotten luckier — on one measly question. And that’s more critical than the precise method by which the floor of the scaling gets set.

    But the Level 1 kid is now an election year political football to be held back because in the words of our pizza-loving Charter-Chancellor, that Level 1 student “didn’t ‘do the work.’”

    I got your sarcasm, so keep it coming. Not sure you get mine.

    P.S. Clock presumed analog.

  • http://MoreThoughtful.blogspot.com ceolaf

    Michael M.,

    I’m not sure what point you are trying to make with your clock analog-y, in this context. Clocks are not ambiguous, unlike student raw scores.

    The big point here is not about floor — an issue that you raised. Rather, it is about misuse of test. it is about whether the test support the inferences being made. The difference of a single point is always an issue, and it addressed with confidence intervals. That’s well established, and should not be a big deal.

    High stakes decisions being made based up improper inferences. Perhaps something could be done to make the inferences stronger, but the failure to correct for guessing — a problem so well highlighted by Ms. Senechal — is just the latest demonstration that this regime has no interest in ensuring that these tests are valid and reliable.

  • http://www.thisweekineducation.com john thompson

    Brilliant. post.

    I’d just add that this hypocritical social promotion through the back door produces the worst of both worlds.

    As Aaron’s recent post shows, you can have a debate on social promotion in general, and I never know how to decide on that question. But this dishonest version of just passing kids on is the most destructive way of doing it.

  • Michael M.

    Agreed.

    What I was trying to get at was, again, the red herring of blank answers. Agreed that the issue is guessing. I do follow your 10 vs 13. My point was there’s no way to tell if the kid who got 13 did so by guessing, or by being a bit smarter than the kid who got the 10. To me, the blanks just confused the picture.

  • Michael M.

    JT,

    To paraphrase PEP member Patrick Sullivan, what’s the point of having a student repeat a grade — with no additional program or intervention in mind, let alone funded — when the prior year in the about-to-be-repeated grade, the student scored effectively ZILCH (random only).

    And Klein/Blooomberg’s “doing the work” rhetoric I find to be a heartless and stigmatizing blaming of the least able who need the most help.

  • Anonymous

    Perhaps I am misunderstanding the policy, but I don’t think that anybody is mandating the promotion of kids who score at level 2. As I understand it, promotion is generally a local decision, i.e. up to the principal. I believe that the new policy is only removing discretion from the principal in the case where a student has scored at level 1–which, as Ms. Senechal points out, requires a very low test score indeed. And according to the press release, students “who score at Level 1 are provided intensive remedial support, as well as the opportunity to attend summer school and to re-take the test.” That strikes me as a generally reasonable approach, provided that this remedial support isn’t strictly limited to level 1 students.

  • Michael M.

    But if simply retaking the test, even the next day, is likely to provide a Level 2 score, what are we really doing?

    Frankly, I’d prefer it if such decisions were up to the principals — in conjunction with the teachers, parents, and social workers (lest we forget the social impact of “flunking” a child) — and not left up to the big no-bid IBM selectric in the sky.

    As I understand it, we’re not talking voluntary repeats of Level 2′s; we’re talking mandatory hold-backs of Level 1.

  • http://MoreThoughtful.blogspot.com ceolaf

    Michael M,

    I know that this is a technical matter, but it is still a little bit important.

    Wrong answers vs. blanks tell *do* us something. Wrong answers tell us that the kid was trying give an answer — though clearly not whether or not s/he was trying to do the work to get the right answer.

    They thing I think that you are missing is not the the difference between a single blank and a single wrong answer, but rather the difference between multiple (i.e. > n -1) blanks or wrong answer. That many wrong answers tells that the kid was giving answers to questions that lacked the skill to answer correctly. And if there were n-1, answered incorrectly, simply by luck the kid probably answered 1 correctly. Whether it was through sheet guessing or dumb luck luck, that’s one question that the kid got right without having the skill to answer it.

    What ought we to do we do in that situation?

  • Michael M.

    ceolaf,
    You’re wearing me out.
    Does the way these tests are scored score blanks and wrong answers differently, regardless of number?
    The last word is yours…

  • http://www.thisweekineducation.com john thompson

    MM,

    That’s why I have no idea what I think about social promotion in general. But rarely do we face these issues in an abstract setting.

    Decisions regarding social promotion, remediations, interventions, enhancements, credit recovery or whatever need to be discussed honestly given the messy realities of what situation is being addressed.
    But santimonious fig leafs, that are the forte of Bloom/Klein are destructive. I don’t think we disagree,

  • http://joannejacobs.com/2009/08/18/guess-pass-2/ Guess pass at Joanne Jacobs

    [...] New York students can score in Level 2 — good enough for promotion to the next grade — by guessing on the end-of-year exam, claims Diane Ravitch. [...]

  • http://MoreThoughtful.blogspot.com ceolaf

    Ms. Alpert,

    I agree with most of what you have written, until the end. While Bloomberg has given this even more political significance than it usually has, its significance is *not* limited to politics.

    There *are* political implications, and these things do impact the education of children and educational reform — even apart from the political avenue. This testing regime goes back further than Mayor Mike’s tenure as mayor. High quality testing does have educational value, and weaknesses in our assessment programs have at *at least* an opportunity cost for students and educators.

  • http://www.specialeducationmuckraker.com Dee Alpert

    You write that “High quality testing does have educational value, and weaknesses in our assessment programs have at *at least* an opportunity cost for students and educators.”

    How true. But I’ve looked – hard – since NCLB’s inception and so far, I’ve failed to find one state which has a system of what you’d call “high quality testing.” I’m certainly not implying that this can’t be done – I’m simply noting that our current American education industry and infrastructure are either unwilling or unable to produce one and then let the education industry in the states live with the results of those high quality tests. And as with most things in the education industry, NY is both the most expensive and the most corrupted.

    So given the hand that we’ve been dealt, for me the issue is whether any of the data purportedly euchered out of the testing system is valid and reliable for any purpose whatsoever.

    Unfortunately, I think not.

    Now, for the other side … I always have fun telling folks about the USDOE-sponsored, longitudinal, national, large-scale research done re kids with disabilities at the elementary and secondary levels. These show, without exception, that there is “almost zero” correlation between the subjective grades teachers give kids with disabilities and those kids’ objectively-assessed reading and math levels. I’m talking about diagnostic testing for real reading and math levels, not the NCLB-mandated tests.

    So for the 15% of kids who are classified as having some disability or another, if we don’t have a legitimate testing system and we don’t have legitimately reported results … we wind up with what we have now: massive numbers of kids with mild disabilities who’ve been graduated from high school with allegedly “regular” high school diplomas and who can’t read or do math at anything close to the high school level. I suspect that these are many of the NYCDOE’s kids who have been given the benefit of bogus “credit recovery” in order to get their diplomas.

    For me, these are the kids who are the real victims of the ridiculous tests NYS provides and the even more ridiculous scores our State and NYC spinmeisters have massaged so well.

    Dee Alpert, Publisher
    SpecialEducationMuckraker.com

  • Michael M.

    Dee,

    Thank you for your comments and your perspective.

    My younger brother, now in his 40′s, has CP and significant physical handicaps, the term back then. If it were up to standardized testing (without accomodations), he’d never have made it to grad school. People, especially those facing severe challenges, are themselves not standardized.

    I have seen the disaggregated data for ELA and Math for D2 elementary and middle school students. The scores for Special Ed kids are lower than those of many other sub-groups. Between the implication of the scores, and quite possibly the testing structure, we’re doing children with disabilities a disservice twice over.

  • Marty

    This post helps me understand why I, a social studies teacher, found ARIS not to be of much help after my principal made a big deal of requiring us to log on and check our students’ scores (in my case I was checking the 8th grade scores of my 9th graders). Quite frankly, I can assess the kids better myself despite having 34 in each class.

  • http://transparentchristina.wordpress.com/2009/08/26/when-social-promotion-essentially-becomes-necessary-to-run-the-schools/ When social promotion essentially becomes necessary to run the schools….. « Transparent Christina

    [...] Posted by John Young on August 26, 2009 bad things can happen….what a mess this must be. [...]

  • Nuisance

    At least in North America jobs that require meaningful levels of education have been in decline
    for decades. Even in the financial industry very little is required, as we have all learned
    recently. So, I really don’t see what the problem is. All we need is a system that warehouses
    the youth for time so that unemployment doesn’t spike to levels high enough to result in
    violent revolution.

  • http://www.thelatestliberalblogs.com/liberalblogs/2009/09/28/dan-brown-grading-the-big-tests-a-study-in-madness-and-a-really-good-new-book/ Dan Brown: Grading the Big Tests: A Study in Madness… and a Really Good New Book | The Latest Liberal Blogs

    [...] corporations have vested interests in getting favorable stats. Last month, Diana Senechal at gothamschools.org proved it was possible to guess randomly on New York State exams and [...]

  • http://blog.coreknowledge.org/2009/09/30/moving-the-chains/ Moving the Chains at The Core Knowledge Blog

    [...] Diana Senechal recently described an experiment in which she was able to “pass” several standardized tests just by guessing and without even [...]

  • http://blog.coreknowledge.org/2009/11/16/ed-blogger-named-to-common-standards-panel/ Ed Blogger Named to Common Standards Panel at The Core Knowledge Blog

    [...] who until recently taught at a Core Knowledge school in New York City, made waves recently when she showed that it was possible to pass New York State ELA and Math tests by simply [...]

  • Kenny

    This analysis is completely flawed. You cannot say anything about the ability of a student to pass “randomly” by completing two tests. Simply put, both the answer choices and the order are randomly assigned. A randomly completed test will get 1/N * Q raw points where N is number of answer choices and Q is number of questions. For this that is a 25% grade on the multiple choice section. However there are N^Qth ways of completing the test, and for a significant number of them, the grade will be passing. Increasing N makes the test harder to pass, while increasing Q increases the number of chances to measure a students’ aptitude.

    On a well-designed test, more questions mean that the questions will range in difficulty, and thus increase the chances of differentiating each level of skill, but note that it will NOT increase change the random probability of passing the test. Specifically, a significant number of easy questions help lower the chances that a student has literally not read the test and is randomly choosing answers. 

    All you have shown is that you have found two out of the N^Mth ways of completing this test that result in a pass. There are a number of ways to change these factors but absolutely no way in hell that you can control for a “random answer” strategy of passing, any more than you can control for a randomly chosen lottery number being a winning ticket. There will always be at least one, and in a good test, many more. Tests are evidence of a student’s ability, not a final judgement.

    Any of these issues would be covered in the most basic courses a teacher should take in statistics. Whether teachers are able to pass their statistics classes in university by adopting a random choice strategy without understanding statistics, I cannot say, but I believe that any of them would be sufficiently competent to understand the above.

  • Diana Senechal

    Kenny,

    The experiment was informal, but in fact the odds of getting a 2 by guessing on these tests are quite high. I did some followup, and these were some of my findings.

    For each test, I calculated the mean by dividing the total number of multiple-choice questions by 4 (since there are 4 options for each question). I calculated the standard deviation by taking the square root of (total number of questions x 0.25 x 0.75). I used a binomial distribution probability calculator from there. Consulting the score conversion tables (raw to scale and scale to performance level), I arrived at the following:

    1. On the third-grade math test, there are 25 multiple-choice questions. A student need only answer 11 correctly to get a 2 (without doing anything whatsoever on the written portion of the test). There is approximately a 3 percent chance of doing this through random guessing.

    2. On the fifth-grade ELA test, there are 24 multiple-choice questions. A student need only answer 8 correctly to get a 2 (without doing anything whatsoever on the written portion of the test). There is approximately a 23 percent chance of doing this through random guessing.

    3. On the sixth-grade ELA test, there are 26 multiple-choice questions. A student need only answer 7 correctly to get a 2 (without doing anything whatsoever on the written portion of the test). There is approximately a 48 percent chance of doing this through random guessing.

    4. On the seventh-grade ELA test, there are 30 multiple-choice questions. A student need only answer 9 correctly to get a 2 (without doing anything whatsoever on the written portion of the test). There is approximately a 33 percent chance of doing this through random guessing.

    5. On the seventh-grade math test, there are 30 multiple-choice questions. A student need only answer 11 of them correctly to get a 2 (without doing anything whatsoever on the written portion of the test). There is approximately a 20 percent chance of doing this through random guessing.

    Now, this assumes that the student only guesses on the multiple-choice questions and does nothing on the written parts. But what if the student knows a few of the answers—or earns a few points on the written section? Here are some scenarios.

    1. Let’s take the third-grade math test. Let’s say that in addition to guessing on the multiple-choice part, the student earns two points on the written portions. His or her chances of getting a 2 go up to about 15 percent.

    2. On the fifth-grade ELA test, if the student earns two points on the written portions of the test and guesses the rest, his or her chances of getting a 2 go up to about 58 percent.

    3. On the sixth-grade ELA test, if a student earns 2 points on the written portions and guesses the rest, his or her chances of getting a 2 go up to about 82 percent.

    4. On the seventh-grade ELA test, if a student earns 2 points on the written portions and guesses the rest, his or her chances of getting a 2 go up to about 65 percent.

    5. On the seventh-grade math test, if a student earns 2 points on the written portions and guesses the rest, his or her chances of getting a 2 go up to about 33 percent.

    And now compare the odds with the numbers and proportions of students at level 1. Is it a coincidence that, in 2009, 0.1 percent (yes, one-tenth of one percent) of sixth-grade students in New York State scored at level 1 in ELA? In New York City, a total of 146 sixth graders scored at level 1 in ELA.

  • http://MoreThoughtful.blogspot.com ceolaf wolfhelm

    Kenny,

    I think that you fundamentally misunderstand the argument.

    We can predict how many questions students will get right with random guessing. Sure, there is a variance to the actual result, but random guessing will generally result students get 1/n of the questions correct (where n is the number of answers to choose from).

    The issue here is NOT that the questions have been made too easy. The issue is that the cut score has been dropped too low. Imagine needing just 1/2n of the questions correct to pass. In that case, we would expect almost all the kids to pass, especially if they just guessed for each question. Well, in this case, the cut score is not nearly enough above 1/n – if one takes into account easy points on the written response section — to make the entire test credible. Add in some easy question, (and there SHOULD be some easy questions, right? I mean, that would would be good test design.) and the test overall simply becomes too easy to pass.

    Yes, Ms. Senechal’s dramatic and true examples make for a great story while perhaps backgrounding the fundamental problem. But that doesn’t mean that she is, at root, wrong.

    There are well-known techniques for dealing with guessing. Yes, they make statistical analysis more difficult, and scoring slightly more difficult, but many are well established. I would argue that a practical late step to take in test design would be to make sure that the most common guessing patterns do not result in passing scores.

    Moreover, true random assignment of answer orders and questions orders is actually a bad idea, because it can results in bad outcomes. Randomization is a weak answer. We already know that if you have too many of the most difficult items early, you can depress scores on the test. So, if randomization is one of your strategies, you cannot stop there. Once you randomized, you’ve GOT to check for known issues, and then re-randomize if former roll of the dice fell into any of them.

  • Kenny

    Ceolaf,

    You are confusing a number of test design parameters. The order and structure of questions is not typically random for a number of reasons. The order of answers is almost always entirely random, even when such answers are themselves ordered (i.e. in ascending numerical order). You also are confusing the psychological aspects of test design (i.e. having easy answers) with the statistical aspects (i.e. random choices should be easily discernible from non-random). No matter whose test design principles the State supports, the test can be effective from a statistical standpoint.

    There are indeed well-known techniques to neutralize guesses, but most of them do not revolve around attempting to minimize the score of any particular pattern, because there are too many patterns. For some reason, tests are generally designed to have a roughly equal number of correct answers assigned to each answer choice which does minimize the effect of “all C’s”.

    As for your point regarding the “the cut score has been dropped too low”, I agree. I would note that rather than blaming the mechanism, you must simply state the problem: “students who guess every single answer stand a very high chance of being promoted, a small amount of effort on their part increases those chances to near-certainty”. In Diana’s post in the comments, I think we have much more well-supported evidence that this is true in several instances and I feel much more confident relying on that versus data from only two trials.

  • http://www.alanlawrencesitomer.com/2010/03/23/the-absolute-folly-of-bubble-tests-widely-exposed/ The absolute folly of bubble tests WIDELY exposed!! » Alan Lawrence Sitomer

    [...] read this. I have never seen the folly of the bubble tests exposed in a more lucid, “I can’t [...]

  • http://www.arsdocendi.org/2010/04/16/legislating-performance/ Legislating Performance « Arsdocendi.org

    [...] To put this in perspective, you must try and imagine an interview by CNN of the principal of a school in California or Florida, in which the principal thanks his home state’s politicians for the trust they repose in him and his teachers. You must also imagine breakfast tables across California with school children eating while having a friendly give-and-take with Mom and Dad before heading off to school. And you must imagine young Californians and Floridians beaming over their scores on graduation tests that have not been cooked so that students can pass them with random answers, as can happen with tests for promotion in New York. [...]

  • http://www.arsdocendi.org/2010/08/28/a-problem-of-fine-distinction/ A Problem of Fine Distinction « Ars Docendi

    [...] It is also possible that such places are loci of pedagogical or intellectual scandal, as in New York, some of whose “proficiency” tests for promotion could be passed by random guesswork. But let us assume competence, good will, and normal distribution of aptitude for the sake of this [...]

  • http://www.arsdocendi.org/2011/03/06/390/ Ars Docendi

    [...] such as the “realignments” of SAT scoring or the setting of “proficiency” exams by which random guesswork can yield a promotion from one grade to the next. Even now, some testing retains a pre-Potemkin integrity. Hence the PISA test scores or the scores [...]

Tips, questions, feedback?

Contact us at .

Follow GothamSchools

RSS

Feb. 10: You’re invited!

Chalk It Up

Recent Comments

38 comments so far today

Our Twitter Updates

Archives

February 2012
M T W T F S S
« Jan  
 12345
6789101112
13141516171819
20212223242526
272829