Since 2010 most of my time spent on academic research has focused on two particular areas:
- The use of randomised control trials (RCTs) to support inappropriate, or overly strong, policy claims or recommendations
- Empirical examples of how this has manifested in the economics of education.
I was therefore somewhat frustrated when I attended a presentation at the Economic Society of South Africa conference in 2013 to find some rather strong policy claims being made on the basis of what is very weak evidence (even by the standards of practitioners favouring RCTs). I raised my concerns with the relevant author, but I see that the recently-published working paper contains the same problems.
It therefore seems appropriate to summarise my concerns with this work: partly so that interested parties can understand its flaws, but mainly to provide an illustration of how the new fad for RCT-based policy is often oversold. That’s important, because despite seemingly ample evidence I often get economists saying: “Oh but no-one really uses RCT results in that way”.
Short summary of the paper
The authors randomly allocated exam study guides to Grade 12 pupils in a sample of schools in Mpumalanga province (South Africa), for four different subjects: accounting, economics, geography and life sciences. Of the sample, 79 schools received the guides and 239 did not. These guides were already available online, but the explicit assumption is that the schools in question would not have known that or been able to access such online resources.
The study focuses on students’ final National Senior Certificate (‘matric’) exam scores as the outcome of interest. The authors find that in two subjects there was no statistically significant difference in outcomes, but that for two subjects students did score two percentage points higher on average if they received study guides. This conclusion is reached by regressing average school NSC scores on a binary variable representing whether a school was selected for study guide distribution or not.
Based on this finding and a rough ‘cost-benefit analysis’ the authors conclude that: “distributing the geography and life science [study guides] at scale… is amongst the most cost-effective of educational interventions internationally that have been tested using randomised experiments”. [For readers unfamiliar with this kind of research: that’s a very strong claim!]
In a number of basic respects the paper can be considered a local imitation of the paper by Glewwe, Kremer and Moulin (AER, 2009) on the impact of randomly assigning textbooks to schools in Kenya.
Problem #1: Fishing for statistical significance?
The first major problem with the paper arises from the fact that its hypothesis is very crude: do study guides improve NSC scores? The hypothesis itself has little contextual foundation and no subject-specific rationale either. Such ‘atheoretic’ approaches have been on the receiving end of quite scathing critiques (such as Cartwright, Deaton, Heckman and Keane among others) and I will not repeat those here. Suffice to say that even if we agree that it is not entirely pointless to conduct such an experiment, the lack of a richer conceptual (theoretical) basis for the experiment becomes a problem when authors want to depart from this simple hypothesis.
In this case the appropriate answer to the basic hypothesis would appear to be: the study does not provide statistically significant evidence that study guides improve performance. I say ‘appear’ because the paper does not report this result. However, that is not the conclusion of the paper. The authors argue (having already seen the evidence) that actually there are four hypotheses we should test:
- Do study guides improve matric exam scores for economics?
- Do study guides improve matric exam scores for accounting?
- Do study guides improve matric exam scores for life sciences?
- Do study guides improve matric exam scores for geography?
They then treat the findings for each subject as separate from the other subjects. This allows them to conclude that there is a statistically significant, positive and sizeable effect of sending study guides to schools for two subjects: geography and life sciences.
While there is a genuine methodological issue one could debate here (see one recent discussion here), when pursued in the manner of this paper I would argue that this constitutes inappropriate data mining: authors who don’t find significant main results start disaggregating their samples or treatments in order to find a significant result.
To be maximally fair to Taylor and Watson, much higher profile authors have engaged in similar practices in the past (albeit hidden behind more technical sophistication) and seemingly gotten away with it. I’m pleased to say, though, that the proverbial worm has been turning in the last decade and this kind of approach to multiple hypothesis testing, and also treatment heterogeneity analysis, is increasingly recognised to be dubious (even by JPaL).
The apparent good news for the authors in this case is that even with a simple Bonferroni adjustment (which is often viewed as conservative), the two findings remained significant at the 5% level (though not the 1% level reported).
Having said that, I am sceptical about the main results reported because they are based on school-averaged scores “weighted by the number of pupils that wrote that specific subject”. As far as I know this is an unusual approach to school-level treatment assignment. I’d be interested to see the results from using a student-level specification, as done by Glewwe et al, with one treatment and appropriate standard errors.
Problem #2: Inappropriate extrapolation (external validity problems)
The second major problem I raised with the author was external validity. The paper now throws in this term a few times, but with little substantive engagement with the problem.
In my PhD research, also published here and here, I argue that there is currently no rigorous basis for extrapolating results from RCTs. That being the case, basing a cost-benefit analysis and policy recommendations on assuming that all schools in the country will experience similar results to 79 schools in Mpumalanga is unjustifiable.
Moreover, while the authors note that they use various criteria to determine which schools were included in their final sampling frame (318 schools) they never report the total number of potentially eligible schools in the province. According to the Department’s Schools Masterlist for 2013, there were 738 secondary (or combined) schools in Mpumalanga. I.e. the study apparently excluded half the province’s schools and its discussion of extrapolation entirely ignores possible implications of that.
Problem #3: What is being estimated?
The authors randomly assigned study guides to schools but have no actual information on usage of these, either at the school or pupil level. Later in the paper (but not in the results section) the authors note that this means they estimate an ‘intent-to-treat’ (ITT) effect, not the actual effect of study guide usage.
The correct technical approach, if the relevant data was available, would be to treat assignment as an instrumental variable. In that case, however, the aim is to estimate a local average treatment effect (LATE), which is a function of the presence of ‘compliers’ (students who use the guides if their school gets them, or student who don’t use them if their school doesn’t get them) and ‘non-compliers’ (students who use or don’t use the guides regardless of whether their school gets them).
One problem with LATEs is that they are likely to be even less generalisable than Sample Average Treatment Effects, because the proportion of compliers and non-compliers may vary across populations. The authors appear largely unaware of this issue. Instead, they argue that the ITT is likely to be smaller than the treatment effect on students who use the guides and that therefore the likely benefits of an expanded programme are larger.  Unfortunately, there is no actual evidence from the study to support that claim.
Furthermore, the authors state that “one can confidently rule out” the possibility that the observed treatment effect is due to ‘experimenter effects’ (Hawthorne and John Henry effects). However, I think they fail to understand that any effect unrelated to study guide usage would fall in that category. For example: students might work harder simply because they feel privileged relative to a neighbouring school that did not receive the guides. The authors therefore can not, in my view, rule-out John Henry or Hawthorne effects and this is also relevant to extrapolation.
Given all the issues above, I would be very concerned if this kind of approach to evaluation and policy formulation took hold in the South African education sector. While I appreciate the authors’ good intentions — to contribute evidence to improve basic education in South Africa — I don’t think the paper actually adds much to our knowledge base. RCTs may have some uses, but they continue to be widely oversold.
 There are various other objections from structural econometricians to the standard approach to RCTs, but it is not necessary to discuss those here.
 I say ‘inappropriate data mining’ because data mining is increasingly recognised by econometricians as not being a problem in itself, but rather as a problem if not done according to well-founded principles and its results assessed accordingly. This issue has also been one motivation for pre-analysis plans; more on those in later posts.
 The authors do not reference any sources for their discussions of external validity. Given that the paper emphasises John Henry and Hawthorne effects, I suspect their preferred source of information on external validity is probably the ‘toolkit’ provided by Duflo, Glennerster and Kremer (2006). Having said that: I have observed a disturbing trend of some academic peers not citing critical methodological work that is inconvenient – more on that in future posts.
 Structural econometricians would argue that one should also think about Marginal Treatment Effects – see the work by Heckman and Vytlacil – but that’s also a subject for later posts.
This work is written in my personal capacity and licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.