Weighing and weighting the evidence: the Measures of Effective Teaching project

The politics of teaching policy continues to produce contentiousness among reformers and researchers as well as administrators, union leaders, and practicing teachers themselves. In response, the Bill & Melinda Gates Foundation invested three years and $45 million in painstaking efforts to find the holy grail of teacher evaluation.

Last week, with rightful fanfare, the Bill & Melinda Gates Foundation released its findings from the Measures of Effective Teaching (MET) Project [PDF]. In short, lead researchers Tom Kane and Steve Cantrell found that a balanced approach to assessing teachers is more reliable and valid than one based on primarily using standardized tests and value-added methods. Researchers also found that multiple observations from different raters, including peers, improved the accuracy of the evaluations. The study involved about 3,000 teachers in seven districts: Charlotte-Mecklenburg, N.C.; Dallas; Denver; Hillsborough County, Fla.; Memphis, Tenn.; New York City; and Pittsburgh.

While the research was under way, the public conversation about teaching evaluation continued. Today, many of the largest and most sophisticated districts and unions take for granted that no one measure can tell the whole story about a teacher’s practice. Even so, the MET Project produced several reports that might be helpful to states and districts as they consider how to balance various measures of teaching effectiveness, including classroom observations, surveys of students’ perceptions of their teachers, and value-added methods that attempt to isolate teachers’ contributions to their students’ achievement gains.

In one of the more interesting conclusions, Kane and Cantrell suggest that states and districts design weights for varying methods of teaching effectiveness. No one measure should greatly outweigh others; otherwise, the result is less accurate and reliable. This finding meshes well with what other assessment experts like Jim Popham have concluded: Using students’ test scores to evaluate teachers is too unstable to be considered in isolation from other measures, and does not allow for deep understanding of the context in which students are taught and tested.

Distinguishing between student learning and student achievement, teacher and blogger Renee Moore writes about the MET Project: “Standardized tests are the instruments we use (for now) to measure student achievement, but there is much, much more that we need to know about measuring student achievement and student learning.”

In a recent paper, Renee and I pointed out that standardized tests, on which value-added methods are based, often assess only a small sliver of content that teachers teach and student learn. They do not measure the student’s ability to think deeply or creatively. And then most teachers teach students whose academic proficiencies span many grade levels, yet the tests they are judged by assess only a narrow band of content. As a result, even value-added tools cannot adequately adjust for the English Language Arts teacher who teaches seventh graders whose reading skills range from third to twelfth grade, when the standardized test they take has mostly seventh-grade items. The underlying assessments the VAMs are based on simply lack sufficient “stretch.”

How Can We Best Implement This Research?
MET researchers recommend that between 33 and 50 percent of a teacher’s evaluation should be tied to value-added methods. What does a quick look at the research mean for states, districts, and unions trying to make research-driven decisions? First, it suggests that they are on the right track to insist on getting many “snapshots” of a teacher’s practice, both in the frequency of observations and in the kinds of data incorporated. However, none of those views into a classroom should be focused upon too much if we want useful and accurate information about teaching.

Finally, MET authors aren’t pushing toward a magic number for value-added weights, giving flexibility to negotiate weights within a range that feel comfortable locally.

The Gates Foundation is promoting the idea that states and school districts will use the research to create evaluation systems to help teachers improve—and their research does indeed support the need for policymakers to build evaluation systems that use multiple measures in thoughtful ways. Incidentally, a thoughtful balance of multiple measures is what, in our experience, teachers recommend as the best way to structure evaluations. (See full recommendations from our teacher teams in in Denver [PDF], Hillsborough County [Sliderocket presentation], Illinois [PDF], and Washington State [PDF.])

Asking the Right Quesions
However, even the best research isn’t without challenges, and it isn’t always easy to act on. State and local decision makers, as well as teachers who will be affected, need to ask the right questions as they translate research into implementation. Here are a few:

  • With all this information, do we need to know anything else as we make decisions about weighting? The research team is highlighting the fact that teachers were reassigned to students in the second year of the study, in an attempt to get true experimental results. Yet some education economists, like Jesse Rothstein and Doug Harris, have pointed to sources of potential bias that are explained in the study’s methodology. Researchers couldn’t make random assignment of students foolproof, and had limited comparisons of teachers across schools. Therefore, the results are still read best in the context of all teaching effectiveness research, not as a silver bullet prescription.
  • How ironclad is MET’s suggested weight range for value-added? It’s probably a good starting point…with a lot of “ifs.” Other researchers point to the fact that the MET Project used test scores to measure effective teaching, and then claimed test scores should be used to determine who is a good teacher or not. RAND authors noted that teaching is “multidimensional” and “choosing weights is unlikely to be as straightforward” as policymakers and pundits might wish, even with this guidance. And of course, student test results—and the value-added scores and decisions based on them—are only as good as those underlying assessments.
  • Why would student assessments matter for the results we’ll get? Study authors found that among the six assessments they used to measure student achievement growth, those that focused on higher-order skills were least connected with value-added scores.
  • Do factors other than weighting matter? Process and understandability might. Scholars at the RAND Corporation offer a technical report [PDF] alongside the study results. They point out that composite scores have been used widely in studies on health care, water quality, and managerial performance. But they warn that if these measures are incorporated back into a single index score of performance, it “may invite misleading and simplistic policy conclusions” if the “process of constructing them is not transparent.”

To be sure, teachers, like growing numbers of CTQ teachers (and those in other networks), are looking for results-oriented evaluation systems that respect the complexity of their work and help them improve their practices in ways that benefit their students.

And on that note, I want to end with the wise words of Renee Moore:

Just because statisticians develop value-added estimates of teacher effects, using this or any other data, does not mean they should be the arbiters of who is effective or not…. I would suggest that one important component of new student assessments is that the results be given not just to the individual teachers of those students, but that teachers be involved in the interpretation and discussion of test data together, in various configurations, before it’s released to anyone else. The ability to examine data deeply in collaborative settings will take learning and skill, for which teachers should be prepared and compensated. This type of data interpretation would be, in fact, a form of de facto peer evaluation and true job-embedded professional development.

Related categories: ,