A theory of change is a wonderful instrument to explore the „why“ and „how“ of an intervention. So why do evaluations make such patchy use of theories of change? Often, it is because evaluation questions ask mainly“how much“. This blog narrates how I have come to this conclusion.
Why do many evaluation reports yield only weak insights? Having worked in all three corners of the evaluation triangle – as an evaluator, as an evaluation commissioner / manager, and as a stakeholder in interventions under evaluation (the evaluands) – I find that we can only put part of the blame on evaluation teams. Often, evaluations come with high expectations which low budgets and narrow timeframes cannot fulfil. If, on top of that, evaluations are poorly prepared, evaluation teams may find themselves struggling with scope creep and shifting goalposts. They will spend much of their time trying to understand the evaluand and negotiating the evaluation scope with the client, wasting time that should be spent on proper data collection and analysis. Better preparation and accompaniment of evaluations could make a big difference. Ideally, that should happen as part of an evaluability assessment and before the evaluation terms of reference (TOR) are finalised.
Howard White, a specialist in evaluation synthesis, has posted a list of 10 common flaws in evaluations. There are other flaws one could find. But I would propose to reflect on solutions that all parts of the evaluation triangle can contribute to. Evaluations work best when evaluators, evaluation managers and those who represent the evaluand work together as critical partners. In recent years, I have supported organisations in their evaluation management, so this post focuses on things that evaluation managers can do to prevent Howard’s ”10 flaws” (in italics below) from happening. Let’s look at them one by one!
1. Inadequate description of the intervention: Ideally, all evaluation reports start with the description of the evaluand. If the evaluand is one project implemented by one organisation in one country, it shouldn’t be too hard to fit that within a couple of pages. If it is a collection of programmes encompassing cascades of diverse activities by hundreds of organisations around the world, evaluators need to be a bit more abstract in their introductory description. But obviously they need to understand the evaluand to design the appropriate evaluation!
Evaluation managers can map the components of the programme, review its theory of change, and organise the documentation so that evaluation teams can make sense of it. This is particularly important if the evaluand is too complicated to be adequately described in the TOR. A good example from my practice was a portfolio evaluation: Before commissioning the evaluation, evaluation management developed a database listing key features of all projects in the portfolio. That made it easy to understand and describe the evaluand, and to select key cases for deeper review. Conversely, in a different assignment, my team spent (unplanned) months trying to make sense of the– sometimes contradictory – documentation and verbal descriptions of the sprawling evaluand.
2. ‘Evaluation reports’ which are monitoring not evaluation: Evaluation managers can prevent this problem by formulating appropriate evaluation questions. Often, evaluation questions start with “to what extent…, followed by rather specific questions about the achievement of certain results. Those kinds of questions risk limiting the evaluation to a process monitoring exercise, or some kind of activity audit. For programme learning, it is useful to ask questions starting with “why” and “how”.
3. Data collection is not a method: Evaluation managers can make sure the TOR requests evaluators to describe the approaches and methods they use in the evaluation, for data collection and for analysis respectively. They can look for gaps in the inception report, ideally checking the annexes as well, to find out whether the proposed instruments match the proposed methodology. That takes some specialist knowledge – ideally, evaluation managers should have substantial first-hand evaluation experience or a background in applied research.
4. Unsubstantiated evidence claims: Evaluation managers can invite evaluation teams to structure their reports clearly, so that each finding is presented with the supporting evidence. Many evaluations I have seen weave their findings and related evidence so closely together that it is hard to tell them apart – a style that is often described as “overly descriptive”. Obfuscating the boundaries between data and evidence can be a strategy to hide findings about gaps and failure in programmes. Where programme teams are hostile to challenging findings, evaluation managers can play a role in defending the evaluation team’s independence, and their mission to support learning from success and from failure.
5. Insufficient evidence: The amount of evidence an evaluation team can generate depends to a great extent on the time and other resources they have. One important role of evaluation managers is to ensure a good balance between expectations from the evaluation and resources for the evaluation. If an organisation expects an evaluation to answer, say, 30 complex questions on an evaluand encompassing tens of thousands of diverse interventions in diverse contexts within half a year, it must be prepared to live with evidence gaps.
6. Positive bias in process evaluations: Positive bias can arise from poor evaluation design (see also points 2, 3 and 4 above). It can also be linked to evidence gaps (see point 5 above) – when in doubt, evaluators hesitate to pass “negative judgements”. But often, positive bias slips in near the end of the evaluation process, when programme managers object against findings about gaps, mishaps, or failure in their programme. That takes back to the role of evaluation managers in fostering commitment to learning from failure.
7. Limited perspectives: Who do evaluators speak to? This problem is related to issues 1, 6 and 7 above. Where resources for an evaluation are limited, fieldwork might be absent or restricted to the most accessible areas (when I worked in China, they called such places “fields by the road”, always nicely groomed). When working on a shoestring, evaluators will struggle to sample, or to select cases, purposefully. But they can still speak to people representing different perspectives. Evaluation managers can encourage that, by mapping stakeholders in the TOR and explicitly asking for interviews with people who are underrepresented.
8. Ignoring the role of others: If most evaluation questions focus on programme performance, evaluators will focus on programme performance. Often, evaluation TOR address the role of others only in a brief question related to the coherence (OECD-DAC) criterion. But questions about effectiveness, impact and sustainability can also be framed to encourage evaluators to look at the influence of other “actors and factors”.
Also, ideally, programmes should be built on preliminary context and stakeholder analyses which should be continuously updated. Where that has happened, that information should flow into the TOR’s context section.
9. Causal claims based on monitoring data: Good monitoring data can be a helpful ingredient in an evaluation that triangulates data from different sources. There is no reason to believe people fake their monitoring data. It is just that most of the time, the amount and quality of monitoring data are inadequate. Monitoring and evaluation specialists can make sure each programme has a monitoring system that produces data which are useful for monitoring and for evaluation. Furthermore, evaluation TOR should remind evaluators the need to triangulate data, i.e., to compare data sourced from different perspectives via different data collection tools.
Howard mentions a separate point under the “9th flaw”, the attribution problem: “Outcome evaluations present data on outcomes in the project area and claim that any observed changes are the result of the project.” But evaluators are not going to solve that problem by collecting data from a greater variety of perspectives. They need to be encouraged to look beyond the evaluand as a likely cause of the desired effects – see point 8 above.
10. Global claims based on single studies: As pointed out by Howard, lessons from a specific evaluation are only relevant for the intervention being evaluated. That is something that everyone in the evaluation triangle needs to be aware of. Evaluation managers are well placed to remind decision-makers in their organisations of the fact that an evaluation is about the evaluand only. It can feed into a broader body of evidence, but it should never be the only basis for decision-making beyond the context of the evaluand.
We have reached the end of the list, but there is so much more that can go wrong in evaluations. Investing in good preparation, and, once the evaluation team is recruited, building rapport and effective communication between evaluation managers, programme implementers and evaluators, are essential for risk management in evaluations.
This year again, I feel privileged to serve on a panel of senior evaluators who advise a multilateral donor on evaluation approaches and methods. And this year again, I feel saddened by the widespread neglect of qualitative data collection. All evaluations I have reviewed (cumulatively, I have reviewed hundreds…) include at least some elements of qualitative data collection – key informant interviews (KIIs), for example, or focus group discussions (FGDs). Even in (quasi-) experimental setups that rely on large standardised surveys, qualitative data are used to build questionnaires that resonate with the respondents, or to deepen insights on survey findings.
We need good data for good evaluations. Too often, the KII and FGD guides I see appended to evaluation reports are not likely to elicit good data: They are worded in abstract language (some evaluators don’t even seem to bother translating highly technical evaluation questions into questions that their interlocutors can relate to), and they contain way too many questions. I have seen an interview guide listing more than 50 questions for 60-90-minute interviews. That won’t work. A FGD guide with 20 questions for a 2-hour discussion with 12 persons won’t work, either. You can gather answers to 20 questions within two hours, but they will come from just one or two participants and there won’t be any meaningful discussion. Discussion is the whole point of a FGD – you want to hear different voices!
In my practice, I like to work with smaller focus groups – about 3-8 persons – and I count about 1-3 questions per hour, plus time for a careful introduction. The questions should be phrased in a way that makes it easy to discuss them – avoid jargon, because jargon spawns jargon, often hard to interpret. The Better Evaluation library provides a helpful video that explains key principles of FGDs, even though I would be careful about mixing women and men in the some settings. In international cooperation, it has become common practice to organise separate focus groups with female and male participants respectively, to avoid male voices dominate, and to surface issues that people don’t like to discuss in front of representatives of other genders. You also need to consider other aspects of participants‘ identity – social class, for example – to obtain reasonably homogenous focus groups. And you could try to find a way of collecting data from people who don’t identify as female or male, especially when you wish to work in a fully gender-responsive (or feminist) manner. (Have a look at this week’s posts on the American Evaluation Association tip-a-day newsletter celebrating pride week!)
Back in 2019, I published a blog post on what I called classism in data collection – a widespread trend in international evaluations to hold KIIs with powerful people only, and to lump those who are supposed to ultimately draw some benefit from the evaluated project into large FGDs. I’ll repost the blog soon because I see this issue over and over again, and it is not only an inequitable practice, it also yields shoddy data. Watch this space!