Measure For Measure

If any school should be a poster child for President Bush’s signature education reform, the No Child Left Behind Act (NCLB), it’s Herbert Marcus Elementary School. The inner-city Dallas, Texas, school is exactly the kind that the law was designed to improve. It’s a shabby, overcrowded place on the edge of a grim indus trial zone, where lunch starts at 10 a.m. to handle the school’s 1,140 students, nearly all of whom live in pover ty, two-thirds of whom don’t speak English as a first language, and whose parents have on average a 7th- grade education.

But the school has done just about everything right in recent years. Principal Conce Rodriguez has introduced reforms that require students to wear uniforms and teachers to submit weekly progress reports on every student in every subject. There’s an expanded pre-kindergarten program, teacher-attendance incentives, and a big tutoring project. Rodriguez even hired a “community liaison,” who has expanded the school’s PTA membership to 700, the largest in Dallas. On a typical day, there are 50 parents volunteering at the school. Marcus’ 97 percent student attendance rate is one of the highest in the city.

The changes at Marcus have paid dividends in the classroom. Under NCLB, the Dallas schools must test kids in reading and math every year between 3rd and 8th grade, and under Dallas’ system of rating those test results, Marcus placed 19th out of the city’s 206 schools, a significant accomplishment for a school with such difficult demographics. Marcus produced six more months of learning in a single school year than some other Dallas schools with equally disadvantaged students.

But that’s Dallas’ system of rating schools. The state of Texas, as required by NCLB, reads the scores differently. Under that federally-mandated rating, Marcus fares much worse–it gets a middling “acceptable” score, which places it only 76th in the city. That’s one step away from being labeled failing under NCLB.

The discrepancy isn’t the product of fuzzy math. It’s the result of competing ideas about how to use test results. NCLB’s strategy is to cast a bright light on student performance and impose reforms on schools that don’t measure up. To do this, it mandates that student populations be judged once a year against a fixed state standard, while Dallas measures changes in individual student performance from one year to the next.

Other Dallas schools are finding themselves similarly penalized by NCLB. Schools ranked 2nd, 5th, 8th, and 16th under the city’s rating system were ranked 94th, 77th, 83rd, and 107th in the Texas NCLB regime. At the same time, the school that placed 3rd under the state system was ranked 25th under Dallas’.

The sense that Texas’ NCLB ratings are unfair has demoralized the staff at schools like Marcus. Several of Rodriguez’s veteran teachers have left in frustration for wealthy suburban school districts where they felt their achievements would be more fairly recognized, staff members at Marcus say. Keeping good teachers is always a struggle for inner-city schools, and in many cases, NCLB has made things worse.

Since shortly after its passage, the law has been under heavy attack–from congressional Democrats who say the administration hasn’t invested the money it promised to implement the law, from Republican state legislators who resent the federal intrusion, and from teacher union leaders who never liked test-based accountability in the first place. But there have been other, less widely publicized, indictments of NCLB from parents, teachers, and principals who support high-stakes testing but see how NCLB is playing out in the classroom.

Educators in impoverished neighborhoods argue that the law’s criteria for school success don’t sufficiently take into account of how immensely far behind many of their students are when they start school. Educators and parents in affluent communities, where students routinely score above state standards, have a different complaint: that the NCLB accountability system is leading to a dumbing down of their schools’ curricula. Many testing experts, meanwhile, point out that NCLB creates a host of perverse incentives, including encouraging states to set their academic standards low to reduce the number of their schools labeled “failing” under the law–the opposite of what NCLB’s authors intended.

Education Secretary Margaret Spellings has responded to these charges with modest retreats, loosening some of the criteria by which schools are labeled failing so that fewer of them will be. These steps may buy the administration some short-term political relief. But they don’t address the law’s core problem: NCLB doesn’t accurately measure the extent to which schools are improving student achievement.

This is no small flaw. The failure of public schools to educate America’s most disadvantaged students is the country’s most glaring and abiding social and moral problem. Over nearly two decades, a rough national consensus has developed to improve schools by holding them accountable for their students’ performance via high-stakes tests. But the belief that the test ratings are fair and accurate is the linchpin of this whole system. And that belief is weakening. Even the Bush aide who helped design NCLB has his doubts. That’s why it’s worth looking at the Dallas school system, not just as an example of what’s wrong with NCLB, but as a model of how to fix it.

Why does the Dallas school rating system do a better job of identifying high-performing schools than Texas’ NCLB-driven system? The answer lies in the different ways the two systems define student performance. Under NCLB, Texas measures school performance based on a system called “adequate yearly progress,” which calculates the percentage of various student populations that annually meet or exceed the state’s academic standards; in other words, NCLB measures the progress of student groups towards a universal fixed point.

Dallas, by contrast, measures individual student progress from a relative starting point. It compares a pupil’s current test scores with the same pupil’s scores a year earlier. A school in Dallas wins a high rating if its students on average score higher than would have been predicted based on those same students’ prior level of achievement and if the school’s performance overall is better than that of other schools with the same demographics. It earns a low rating if its students perform worse than would have been predicted.

Unlike NCLB, the effect of Dallas’ approach is to factor out the disadvantages (or advantages) that students bring to school because of the quality of prior instruction or their family backgrounds. Instead, it measures only the amount of knowledge that the school itself–or indeed, an individual teacher–is responsible for imparting. It calculates, in other words, the amount of “value added” by each school.

By doing so, the “value added” school-rating metric provides a more accurate picture of which schools are actually educating their students well. It is also fairer to schools and teachers working with the most disadvantaged kids. It pressures them to perform without penalizing them for taking on the hardest assignments in education. Conversely, the system doesn’t reward rich schools with privileged students merely for standing still. Passing the state test, an easy task for many of their students, is not good enough.

This greater accuracy and fairness in turn provides a host of other advantages. It offers districts and states a firmer basis upon which to mete out rewards to good schools and sanctions to bad ones. It gives teachers clearer pictures of how individual students are progressing, making it a more useful diagnostic tool. It gives principals a system of measuring teacher ability–a system that even most teachers who work under it think is fair. It potentially offers parents much more accurate and fine-grained information than they can currently get on how individual schools and teachers are performing, thus making the job of choosing the right school easier. And it avoids some of the perverse incentives that NCLB has created to set standards low or otherwise game the system.

Experience has borne out these benefits. In places such as Dallas where a value-added metric has been tried, district officials say, it has proven superior in pointing out the strengths and weaknesses of different teachers. When Dallas researchers looked at the city’s schools with the highest ratings under its value-added system, it found that they indeed had all the attributes associated with good schools. They were orderly and safe. Teachers knew their subjects and their students well. Students who fell behind were able to get help outside of class. And principals were able to recruit top teachers and, perhaps most important, push out bad ones.

Robert Mendro, the district’s research director, says that the Dallas schools such as Marcus that have used the student and teacher information that value-added testing gives them to target reforms have outperformed their peers. His only regret is that more Dallas schools haven’t done so. Indeed, district-wide test scores have fluctuated in Dallas since the introduction of value-added testing in the early 1990s–a function, he says, of the fact that the Dallas accountability system offers only modest rewards for high performing schools and lacks tough consequence for schools that fail to improve.

Tennessee has also put the value-added metric to good use. In 1992, education officials began giving every school a value-added rating and they taught educators how to use the value-added analyses to target schools and teachers for improvement. Subsequently, Tennessee test scores trended up during the 1990s. Researchers at the University of Tennessee found that average math scores rose in 40 percent of the state’s school districts and declined in only 8 percent between 1991 and 1999. Half as many districts improved in reading, but an even higher percentage moved up in language arts and science. What’s more, the study found that the improving districts improved a lot. Their students opened a 1.7-year gap in math proficiency between the 3rd and 8th grades, and a 2-year gap in reading. “You could definitely see the systems that were using the value-added information and those that weren’t,” state testing director Mary Reel told me. The lesson: No school rating system will raise test scores by itself. But when a value-added system is tied to in-school follow-through, the results can be impressive.

Recently, I spoke with Sandy Kress, a veteran school reformer now with the law firm of Akin Gump. Dallas’ value-added school rating system, he told me, is a “sophisticated tool” and “more appropriate” than the NCLB rating system. And Kress should know. He helped design both systems.

A Treasury Department aide in the Carter administration and later chairman of the Dallas Democratic Party, Kress was chosen in 1990 to lead a school reform commission for the district. Looking for a way to motivate the city’s schools, Kress learned that the district’s research staff had tinkered with a school incentive system in the mid-1980s that had used simple value-added ratings. The staff hadn’t had the resources to resolve the technical challenges they faced, and they shelved the plan. Kress dusted it off, worked with statisticians on his task force (one member was an epidemiologist), the district’s staff, and outside researchers to refine it, and made it the cornerstone of the commission’s recommendations.

The Dallas school board embraced Kress’s reform blueprint, and in 1992, the value-added testing model became the basis for a citywide accountability system that gave substantial financial rewards to top-rated schools and targeted low performers for reform.

Kress’s work on the commission, and in particular his innovative accountability plan, landed him a seat on the Dallas school board in 1992. It also attracted the attention of Texas Lt. Gov. Bob Bullock, a friend of Kress’s since law school. Bullock named Kress the head of the accountability sub-committee of a statewide education reform commission that he was leading for Democratic Gov. Ann Richards. By 1993, Kress had created for the state’s Democratic leadership the Texas school accountability system that eventually helped George W. Bush win the White House.

Bush had sought out Kress when he captured the Texas governorship in 1994 and turned to him again in crafting the no-child-left-behind theme that played so well during the 2000 presidential campaign. Bush brought Kress, by then a friend and ally, to Washington with him as a special assistant, and the former Carter official played a key role in shaping and winning congressional passage of NCLB, the signature domestic policy accomplishment of the president’s first term.

But as Kress’s reform efforts progressed from Dallas to Austin to Washington, the value-added concept fell away. Texas policymakers (and later officials with the Bush presidential campaign and the White House) were looking for a practical concept that liberals and conservatives could agree on; arguing that every student should be held to a single standard made sense to most voters. Judging students not against some objective standard but against themselves, they worried, would merely create a double standard–one for high achievers, one for low. Moreover, the value-added method was a relatively new idea, and few states had the computers or skilled statisticians to administer it.

I first discussed NCLB with Kress in late 2001 in his large corner office in the White House’s Eisenhower Executive Office Building, just as Congress was about to pass the law. He was straightforward in suggesting that value-added ratings were “a whole lot fairer” to schools than the NCLB rating system would be. The NCLB system would favor wealthy schools, he acknowledged.

Members of Congress on both sides of the aisle were just as aware as Kress was of NCLB’s shortcomings–and of the importance of seizing the moment. George Miller, a California liberal and the ranking Democrat on the House Education and Workforce Committee, in particular, saw NCLB as an opportunity to pressure schools to help poor and minority students. Miller and his allies were willing to use a far-from-ideal system of rating schools because they had a chance to focus national attention on traditionally underserved students. “If we waited for statisticians and psychometricians to get [school evaluation] right [in every state] we’d never get anything done [for disadvantaged kids],” said Miller’s education aide at the time.

So, eager to make good on his campaign claim that he was a “compassionate conservative,” and with even congressional liberals backing his plan, the president introduced NCLB on his third day in office. Eleven months later the bill was back on his desk for signing.

NCLB’s accountability system was, Kress and the others believed, the best that could be achieved in 2001, and they were probably right. But the flaws in its testing mandate have become more and more evident with each passing year. In particular, education insiders say that NCLB’s rating system holds schools responsible for factors they can’t control. Studies have shown that such things as English proficiency, family income, and parental education have a major influence on test scores, but NCLB, by judging schools on the basis of whether enough students pass a set of tests once a year instead of on how much schools increase students’ learning over the course of a year, as the value-added model does, fails to account for that reality. Respected measurement expert and accountability advocate Stephen Raudenbush of the University of Michigan has declared NCLB “scientifically indefensible.”

The flaws in the law virtually invite schools to game the system in various unproductive ways. To increase their scores, critics say, many schools lavish disproportionate attention on so-called “bubble” students–those who score just below the state standard. The rationale is clear, if cynical: The easiest way to avoid being labeled failing is to push these marginal students over the line. The more difficult students often get less help as a result. And while timetables built into the law will eventually force schools to deal with those students, there is, perversely, no incentive in the law for schools to focus on their more capable students. Indeed, research by value-added advocate William Sanders has shown that the rate of progress for high-achieving students in low-performing schools in Tennessee has actually declined since the implementation of NCLB.

The value-added methodology, by contrast, doesn’t create such incentives to focus on a handful of students. Under the system, every child’s improvement counts the same towards the school’s overall rating. And the methodology itself is widely seen by those who use it as fairer and more accurate. Value added should thus make it easier for teachers to accept the idea of higher pay for outstanding performance and for working in the toughest schools–changes many see as important next steps in reforming education. Indeed, Dallas is already doing this. Teachers at schools with high value-added scores get financial bonuses.

Many of the obstacles to the widespread use of value-added ratings have been overcome in recent years. Thanks in part to the passage of No Child Left Behind, schools all over the country are on their way to testing every year from 3rd to 8th grade–a prerequisite for the value-added methodology. States are also beefing up their computer and statistical resources. Researchers are still working to address some of the toughest technical issues raised by the value-added method, such as how to measure students who move from school to school and how to compare scores on a subject year-to-year when the curriculum changes. But enough progress has been made that more and more states are looking at the value-added idea.

Ohio and Pennsylvania have launched their own value-added systems, and New York City, the nation’s largest school district, plans to do the same. In the United Kingdom, the Blair government is instituting a national value-added school-rating program. And even U.S. Secretary of Education Spellings has arranged a panel to consider assessment alternatives that might include value added. If Spellings doesn’t wind up advocating the value-added metric for NCLB, Kress hopes Congress will. “[Value added] ought to be one of the central improvements made to NCLB when it’s reauthorized in two years,” he told me.

There’s an argument for replacing the adequate yearly progress method mandated by NCLB with value-added. But the political obstacles to doing so would be considerable. The idea that there should be one standard for all students, regardless of race or income, and that all schools should be held responsible for meeting those standards, is the gravity that holds the liberal and conservative sides of the school reform movement together. Moreover, setting that single standard for all students does seem to have the effect of lifting the aspirations of parents, students, and teachers in many low-income schools, and sparking a sense of panic that is not unhelpful given the dismal performance of many of these schools. Dropping the standards approach entirely makes no sense politically or policy-wise.

One solution might be to publish scores from both the standards-based and value-added methods but to tie rewards and sanctions only to the latter. Another would be to combine the two ratings strategies. That’s what Dallas has done in recent years, Tennessee wants to do, and value-added advocates like Sandy Kress support.

From the outside, Marcus Elementary doesn’t look like much. The front yard is a derelict landscape of dirt and crab grass. Torn shades hang awry in the school’s aging window frames. But its library is bustling, its classrooms are lively, and its hallways are filled with students’ work. It’s not what you’d imagine a school filled with desperately poor, non-English-speaking students would be like. But seeing it up close, you can begin to imagine a better future for American education.

Marcus’ Principal Rodriguez, a short, cheerful man in his mid-30s, told me when we met in his spare, painted-cinderblock office, that his emotions were split between an excitement at the possibilities offered by value added and the problems he has to deal with because of No Child Left Behind. He was pleased that the Dallas rating system has recognized Marcus for the strong school that it is–even as he chafed at the fact that under NCLB he’s expected to produce the same results as his counterparts in rich school systems like nearby Highland Park. “One system’s fair,” he told me. “One isn’t.”

Related

Thomas Toch