Some thoughts on Taylor and Watson’s (2015) RCT on the impact of study guides on school-leaving results in South Africa

Since 2010 most of my time spent on academic research has focused on two particular areas:

The use of randomised control trials (RCTs) to support inappropriate, or overly strong, policy claims or recommendations
Empirical examples of how this has manifested in the economics of education.

I was therefore somewhat frustrated when I attended a presentation at the Economic Society of South Africa conference in 2013 to find some rather strong policy claims being made on the basis of what is very weak evidence (even by the standards of practitioners favouring RCTs). I raised my concerns with the relevant author, but I see that the recently-published working paper contains the same problems.

It therefore seems appropriate to summarise my concerns with this work: partly so that interested parties can understand its flaws, but mainly to provide an illustration of how the new fad for RCT-based policy is often oversold.^[1] That’s important, because despite seemingly ample evidence I often get economists saying: “Oh but no-one really uses RCT results in that way”.

Short summary of the paper

The authors randomly allocated exam study guides to Grade 12 pupils in a sample of schools in Mpumalanga province (South Africa), for four different subjects: accounting, economics, geography and life sciences. Of the sample, 79 schools received the guides and 239 did not. These guides were already available online, but the explicit assumption is that the schools in question would not have known that or been able to access such online resources.

The study focuses on students’ final National Senior Certificate (‘matric’) exam scores as the outcome of interest. The authors find that in two subjects there was no statistically significant difference in outcomes, but that for two subjects students did score two percentage points higher on average if they received study guides. This conclusion is reached by regressing average school NSC scores on a binary variable representing whether a school was selected for study guide distribution or not.

Based on this finding and a rough ‘cost-benefit analysis’ the authors conclude that: “distributing the geography and life science [study guides] at scale… is amongst the most cost-effective of educational interventions internationally that have been tested using randomised experiments”. [For readers unfamiliar with this kind of research: that’s a very strong claim!]

In a number of basic respects the paper can be considered a local imitation of the paper by Glewwe, Kremer and Moulin (AER, 2009) on the impact of randomly assigning textbooks to schools in Kenya.

Problem #1: Fishing for statistical significance?

The first major problem with the paper arises from the fact that its hypothesis is very crude: do study guides improve NSC scores? The hypothesis itself has little contextual foundation and no subject-specific rationale either. Such ‘atheoretic’ approaches have been on the receiving end of quite scathing critiques (such as Cartwright, Deaton, Heckman and Keane among others) and I will not repeat those here. Suffice to say that even if we agree that it is not entirely pointless to conduct such an experiment, the lack of a richer conceptual (theoretical) basis for the experiment becomes a problem when authors want to depart from this simple hypothesis.

In this case the appropriate answer to the basic hypothesis would appear to be: the study does not provide statistically significant evidence that study guides improve performance. I say ‘appear’ because the paper does not report this result. However, that is not the conclusion of the paper. The authors argue (having already seen the evidence) that actually there are four hypotheses we should test:

Do study guides improve matric exam scores for economics?
Do study guides improve matric exam scores for accounting?
Do study guides improve matric exam scores for life sciences?
Do study guides improve matric exam scores for geography?

They then treat the findings for each subject as separate from the other subjects. This allows them to conclude that there is a statistically significant, positive and sizeable effect of sending study guides to schools for two subjects: geography and life sciences.

While there is a genuine methodological issue one could debate here (see one recent discussion here), when pursued in the manner of this paper I would argue that this constitutes inappropriate data mining: authors who don’t find significant main results start disaggregating their samples or treatments in order to find a significant result.^[2]

To be maximally fair to Taylor and Watson, much higher profile authors have engaged in similar practices in the past (albeit hidden behind more technical sophistication) and seemingly gotten away with it. I’m pleased to say, though, that the proverbial worm has been turning in the last decade and this kind of approach to multiple hypothesis testing, and also treatment heterogeneity analysis, is increasingly recognised to be dubious (even by JPaL).

The apparent good news for the authors in this case is that even with a simple Bonferroni adjustment (which is often viewed as conservative), the two findings remained significant at the 5% level (though not the 1% level reported).

Having said that, I am sceptical about the main results reported because they are based on school-averaged scores “weighted by the number of pupils that wrote that specific subject”. As far as I know this is an unusual approach to school-level treatment assignment. I’d be interested to see the results from using a student-level specification, as done by Glewwe et al, with one treatment and appropriate standard errors.

Problem #2: Inappropriate extrapolation (external validity problems)

The second major problem I raised with the author was external validity. The paper now throws in this term a few times, but with little substantive engagement with the problem.^[3]

In my PhD research, also published here and here, I argue that there is currently no rigorous basis for extrapolating results from RCTs. That being the case, basing a cost-benefit analysis and policy recommendations on assuming that all schools in the country will experience similar results to 79 schools in Mpumalanga is unjustifiable.

Moreover, while the authors note that they use various criteria to determine which schools were included in their final sampling frame (318 schools) they never report the total number of potentially eligible schools in the province. According to the Department’s Schools Masterlist for 2013, there were 738 secondary (or combined) schools in Mpumalanga. I.e. the study apparently excluded half the province’s schools and its discussion of extrapolation entirely ignores possible implications of that.

Problem #3: What is being estimated?

The authors randomly assigned study guides to schools but have no actual information on usage of these, either at the school or pupil level. Later in the paper (but not in the results section) the authors note that this means they estimate an ‘intent-to-treat’ (ITT) effect, not the actual effect of study guide usage.

The correct technical approach, if the relevant data was available, would be to treat assignment as an instrumental variable. In that case, however, the aim is to estimate a local average treatment effect (LATE), which is a function of the presence of ‘compliers’ (students who use the guides if their school gets them, or student who don’t use them if their school doesn’t get them) and ‘non-compliers’ (students who use or don’t use the guides regardless of whether their school gets them).

One problem with LATEs is that they are likely to be even less generalisable than Sample Average Treatment Effects, because the proportion of compliers and non-compliers may vary across populations. The authors appear largely unaware of this issue. Instead, they argue that the ITT is likely to be smaller than the treatment effect on students who use the guides and that therefore the likely benefits of an expanded programme are larger. ^[4] Unfortunately, there is no actual evidence from the study to support that claim.

Furthermore, the authors state that “one can confidently rule out” the possibility that the observed treatment effect is due to ‘experimenter effects’ (Hawthorne and John Henry effects). However, I think they fail to understand that any effect unrelated to study guide usage would fall in that category. For example: students might work harder simply because they feel privileged relative to a neighbouring school that did not receive the guides. The authors therefore can not, in my view, rule-out John Henry or Hawthorne effects and this is also relevant to extrapolation.

Given all the issues above, I would be very concerned if this kind of approach to evaluation and policy formulation took hold in the South African education sector. While I appreciate the authors’ good intentions — to contribute evidence to improve basic education in South Africa — I don’t think the paper actually adds much to our knowledge base. RCTs may have some uses, but they continue to be widely oversold.

^{[1] There are various other objections from structural econometricians to the standard approach to RCTs, but it is not necessary to discuss those here.}

^{[2] I say ‘inappropriate data mining’ because data mining is increasingly recognised by econometricians as not being a problem in itself, but rather as a problem if not done according to well-founded principles and its results assessed accordingly. This issue has also been one motivation for pre-analysis plans; more on those in later posts.}

^{[3] The authors do not reference any sources for their discussions of external validity. Given that the paper emphasises John Henry and Hawthorne effects, I suspect their preferred source of information on external validity is probably the ‘toolkit’ provided by Duflo, Glennerster and Kremer (2006). Having said that: I have observed a disturbing trend of some academic peers not citing critical methodological work that is inconvenient – more on that in future posts.}

^{[4] Structural econometricians would argue that one should also think about Marginal Treatment Effects – see the work by Heckman and Vytlacil – but that’s also a subject for later posts.}

This work is written in my personal capacity and licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Author: peripheral economist

Academic, extra-mural public servant View all posts by peripheral economist

5 thoughts on “Some thoughts on Taylor and Watson’s (2015) RCT on the impact of study guides on school-leaving results in South Africa”

Nic Spaull says:

14th March 2016 at 7:07 am

A few thoughts: Initially I was quite skeptical of the result since it seems to suggest statistically significant improvements in some subjects from what I would consider a very light-touch intervention (especially relative to other matric-type interventions). That said I don’t doubt the result or that it was most plausibly due to the study guides. It is still quite a small effect, and only in a few subjects. However, and this is a big however, I think that when all you have is a hammer everything looks like a nail. Given that your phd was on this topic you are much more likely to see where your work parallels (and critiques) this research. However, if your phd happened to be on the absolute dearth of any evidence whatsoever that goes into policy making, or the possibilities for applying randomized techniques to learn something (rather than absolutely nothing) when distributing a limited number of something. Or perhaps how policy-making can improve incrementally from a low base. If any of those were the topic of your research I imagine you would praising this study to no end. Knowing the circumstances (it’s difficult to convince government officials to do even semi-rigorous research) and one of the researchers (who has specifically gone into government to try and improve the system, rather than critiquing from the outside, as many of us do, including yourself up until recently). I don’t think we should pay massive amounts of attention to the finding but the principle is very important, learning some information from projects by using randomized methods where no randomization would have told us sweet nothing. I think this is an overly harsh treatment of a nascent development in government. As I said, when all you have is a hammer all you can see is a nail rather than the many other (good) things around the nail.

Reply
1. peripheralecon says:
  
  14th March 2016 at 8:43 pm
  
  Thanks for the comment Nic. As a start, let me say that my intention with this blog is to approach issues as a professional economist: so I’m not going to get drawn into personalised disagreements.
  
  For the record: I actually worked in national government for a year in 2004 and was on a civil service career track, so for that and other reasons I had fairly good, first-hand experience of the policy and implementation process at that level rather a long time ago.
  As for the hammer and nail analogy: actually this *is* a nail (RCT study making inappropriate claims about extrapolation and policy), and I make no apologies about using a rather effective hammer on it.
  
  As regards your broad suggestion that this kind of research should be judged sympathetically in a particular context, I’m afraid I entirely disagree. Making inappropriately strong claims based on weak RCT evidence does not promote the cause of evidence-based policy making – or at least not a version that is likely to improve policy. I’m sure I’ll have more to say about that in future posts.
  
  On the actual issues at hand. I raise a host of serious technical concerns in the post and your reply does not attempt to address any of them, so necessarily there is not much more for me to say.
  
  Reply
Stephen Taylor says:

18th March 2016 at 11:16 am

Although I’m somewhat reluctant to get drawn into a big debate, let me make a few points about our paper in response to some of your critique.

Mainly, I think what we would openly concede as limitations of our study, you regard as serious flaws or criticisms. In view of the point made by Nic (above), namely that the context is very little impact evaluation of our education policies or programmes, our use of an RCT (a method with considerable advantages but also with limitations) makes a useful contribution. Here was a new resource that had not yet been distributed at scale; the department was willing to roll out the initial run of study guides so as to allow for an evaluation; we would already have good outcomes data through the NSC exams. It was a great opportunity to conduct an RCT at an unusually low cost and in this way obtain causal estimates of programme impact – something that we just about never do. Of course there would be limitations to the study – e.g. we did not collect data on who actually used the books, and findings would not be generalizable to a different population in a statistical sense. But this external validity constraint is true for any sample or study site irrespective of the research method – not only for the method that actually produces solid estimates of casual impact.

“Problem #1”: Do we “fish for statistical significance”? Firstly, let’s not forget that previous quantitative work in South Africa (and in the international literature) often used the education production function methods applied to observational datasets, such as TIMSS. It is much easier to go on a fishing expedition in this way – explore about 40 different covariates and then settle on a subset, perhaps including a largish coefficient on textbook access. The finding then becomes about textbooks but it could easily have been about teacher absence. RCTs go a long way to mitigate against fishing for statistical significance because you are investigating one treatment of interest. With RCTs you are thus much more likely to publish a zero effect, as we do for two of the study guides. With four different treatments (different subject-specific study guides) applied to four different groups of students depending on subject selection with different outcome measures (different exams for each subject) it always primarily struck us that there would be four separate treatment effects to estimate, as opposed to pooling the data and measuring a single effect. You are also sceptical of the school-level model approach. The reason for this is that it allows the inclusion of a baseline measure (we can only observe the same schools in the previous year, not the same students). But we actually did present a student-level model (table 6) where the same basic result is obtained – effects for GEOG and LFSC but not for ACCN and ECON. We haven’t pooled the subjects (as you suggest), but I suspect we might still find a positive effect since the subjects with the positive effects (GEOG and LFSC) are also the subjects taken by larger numbers of students. So, doing it in this way might have actually been more selective on our part. However, what we should have done is to publish a pre-analysis plan to specify the models we would estimate so as to protect us from this sort of allegation. In other RCTs I am involved in we are now doing this, since it has become a widely used practice in recent years. Your suggestion of using the Bon Ferroni adjustment is useful – although as you established it doesn’t change the basic result.

“Problem #2”: Do we make inappropriate extrapolations or overly bold policy claims? You point out that we don’t use all the high schools in Mpumalanga, making it sound a bit suspicious. But we clearly explain the criteria for inclusion in the sampling frame – only quintile 1-3 schools, in which Afrikaans is not the language of instruction, and we excluded schools where these 4 subjects were not all offered. In fact the use of quintile 1-3 schools in a largely rural and poor province was exactly done so as to improve the relevance of the findings for the section of the school system in South Africa that is most urgently in need of policy attention and improvement. This shows our strong awareness of external validity considerations: instead of giving up on doing an impact evaluation because strictly speaking findings don’t necessarily apply to a different population, we picked our study site keeping in mind the wider and specific population of policy interest. Had we worked in Western Cape or Gauteng we would have been more vulnerable to the critique that what works in those provinces may not work elsewhere. Another external validity strength of our study is that we avoid the special experimental conditions which are often a threat to external validity in RCTs – we don’t have an NGO providing close attention to treatment schools, nor did we administer our own tests. Thus, it is likely that individual students would not have been aware that they were part of an experiment.

Next, the policy claims about cost-effectiveness and about the possible impact on the overall NSC pass rate were made on the basis of clearly explained assumptions and calculations. You quote us saying that our intervention is amongst the most cost-effective in the world, and then assist your readers by explaining that this is a bold claim. But we did not simply claim that. We replicated a cost-effectiveness calculation (using actual unit costs and test score gains) done by Kremer, Brannen and Glennerster and on that measure we ranked about 5 out of 15 studies. We were simply presenting that calculation. Similarly, when we do a simulation to estimate the possible impact on NSC results, we are perfectly transparent about the assumptions and calculations made. E.g. If we assume that a similar benefit of 2 percentage points accrued to all students taking geography and life sciences… etc. We are not unaware of or trying to hide the fact that we can’t be sure that the same Mpumalanga effect would hold elsewhere.

By the way, I think we’d be in agreement about the dangers of relying on one or two famous studies to tell us whether something works, e.g. the finding on textbooks from the Glewwe et al Kenya study. Context does matter, and actually we make this point in our study: we had four different experiments on study guides in which the study site was the same and implementation was identical. Yet, 2 guides made an impact and 2 did not. Context, which in this case includes content of materials and the nature of the subject and style of learning, matters.

“Problem #3: What is being measured?” Our point about the effect of the study guides for those who used the guides was brief and was simple: If our observed average treatment effect was 2 percentage points but only 25% of students in treatment schools actually used the guides, then the effect on those students was no doubt larger than 2 percentage points. We didn’t have the data on which individuals actually used the study guides so estimating a LATE using treatment assignment as an instrument was out of the question. Had we been able to use that method then perhaps we would have also acknowledged the limits of the LATE. But really, to suggest that we were unaware of the limits of a method we hadn’t even used in our paper is a bit unfair.

We know the limits of RCTs. I am aware of some of the arguments made by those such as Deaton. But I don’t think we should throw the baby out with the bathwater. We should use RCTs when appropriate and improve our study designs to mitigate some of the challenges, most notably the artificial experimental conditions and difficulties in extrapolating findings.

Another contextual factor to bear in mind about this discussion is that you and I might be speaking to slightly different audiences and problems. You are engaging with the debate in development economics where there is already a keen appreciation for issues around identifying valid causal estimates. In this field lots of RCTs and quasi-experiments have been done; critique of RCTs is therefore a part of taking the field forward. I am engaging in an education policy space in which there is far less appreciation of these issues and where we really need to put some quantitative evidence on the table. In my context, people show the Minister a statistic showing that schools with libraries perform exponentially better than schools without libraries, and the message is that we need to build libraries.

I’d be interested to know what positive approach to impact evaluation in the education sector you would propose? More and better qualitative research? RCTs done on a national sample? Correlational studies? Wait for natural experiments?

Reply
peripheralecon says:

19th March 2016 at 9:37 pm

Hi Stephen, thanks for the detailed and focused response.
Though I have other reservations about pre-analysis plans, I do agree that they largely address hard-to-resolve issues around researcher motives. The only thing I’ll add in this case is that the general approach of your reply is rather different to my recollection from ESSA (2013). For example, the rationale about Mpumalanga being chose for external validity reasons appears in the recent version but not the 2013 version. I won’t dwell further on this point. More broadly, I would suggest that there is enough work on external validity, yes – including my own, that arguments about this should be based on some referenced source(s).
On the specification issue: I don’t really find the argument about school-level specification persuasive. The standard approach in the literature (e.g. Glewwe et al) is to pool at the student level and use clustered s.e.’s so it would seem appropriate to report those results somewhere. That doesn’t prevent one using the baselines in another specification to test the robustness of the results. What I suggested was that the school specification results may be robust to a Bonferroni adjustment but I suspect the student-level results may not be.
I wasn’t suggesting that the failure to discuss the implications of school sample size reduction was suspicious, but rather that if one has a selected sample within a province it would be appropriate to consider the implications of that when proposing/discussing scale-up – the paper doesn’t do that.
I don’t think the way the claims have been presented, either at ESSA or in the paper, communicate to an average reader the extent of the unreliability of the results for informing an expanded intervention. My conclusion was that I’m not even convinced about the internal validity of the results (for specification and multiple hypothesis testing reasons) and even less so about external validity. I’d be interested to know how the results are pitched to policymakers. Dare I say my experience (with JPaL, 3ie, etc) is that uncertainty is not a favoured point of emphasis.
The point about LATE is as much about the fact that you can’t entirely isolate characteristics of the paper. One cannot separate the concern re multiple hypothesis testing and specification, from the concern about not having data on usage and statements re the ITT effect being smaller than the ATT. In my view these pile weak claims on top of each other. E.g. the scale-up discussion suggests that ITT < ATT, but that assumes the results are driven by a causal effect of usage and absent any evidence of that there strictly isn't a basis for the claim. Here’s a deliberately outlandish alternative story that you can’t refute: maybe the schools sold the books and bought biscuits for the students with the money; that increased motivation, improving scores in 'softer' subjects but not in subjects which required stronger prior foundations. One thing we can definitely agree about is that ill-informed fishing expeditions with observational data are problematic. Perhaps you are gesturing at this paper: http://onlinelibrary.wiley.com/doi/10.1111/saje.12091/suppinfo
Various people raised concerns when it was presented at UCT but unfortunately the authors have not softened the policy claims much. I agree that little/no weight should be put on those findings per se.

On your final question: I have some detailed thoughts about alternative approaches to policy-making but need to find the opportunity to flesh them out. Short version: institution building – particularly in developing countries – is more of a practical challenge than an intellectual one, and is not primarily going to be ‘fixed’ by better data or research. Not a lot of researchers or grant funders want to hear that though. I haven’t been able to engage with their work much, but the views expressed by Pritchett, Woolcock and (recently) Hausmann seem closest to mine. See here for a summary:

https://www.project-syndicate.org/commentary/evidence-based-policy-problems-by-ricardo-hausmann-2016-02

More on what researchers should do in later posts…

Reply
Pingback: A collation of my work and writing on randomised control trials (RCTs) – Notes from the Periphery