Skip to content

Blog post

‘Evidence’ and the EEF toolkit: Reliable science or a blunt set of tools?

Terry Wrigley

‘Evidence-based teaching’ is one of those feelgood phrases (think ‘school improvement’, ‘leadership’, ‘standards’) that it seems churlish to oppose. All the more important, then, to consider what such phrases signify and entail in the present context.

The concept and practice is entangled with the status of numbers as authoritative carriers of fact, which in turn underpins the audit culture that dominates public service management, along with ‘governance by numbers’ affecting policy, school evaluation, assessment for accountability, and curriculum choices.

It would be foolish for educators to reject evidence out of hand, as if tradition or instinct were enough, but what now stands proxy for the breadth of evidence is statistical averaging. This abstraction neglects practitioners’ accumulated experience, students’ needs and wishes, feedback, and an understanding of social context. We are facing a serious attempt to replace a rich array of research by the ‘gold standard’ of randomised controlled trials and their statistical synthesis, culminating in the Education Endowment Foundation (EEF) Teaching and learning toolkit if not the hubris of Hattie’s Visible Learning (2009).

This issue is not simply a disagreement among academics: this agenda has received polemical advocacy from Tom Bennett of ResearchED, and strident support from schools minister Nick Gibb. While Gibb explicitly sweeps aside evidence that is inconvenient (on school uniforms, for example), Bennett, in Teacher Proof (2013) rejects any other research as unscientific and worthless. He seems to believe that natural science operates without theory, and that the closed systems of experiments are a mirror of the complex openness of natural – let alone social – situations.

My article in the British Educational Research Journal (Wrigley 2018) looks at the evidence movement’s research methodology and assertions of scientific reliability, drawing on critical realist theory from Bhaskar and Sayer.

There are serious obstacles to setting up a rigorously controlled experiment in our field. Firstly, ‘double blinding’ is almost impossible: indeed, pedagogical innovation simply will not work unless teachers engage enthusiastically and students are persuaded of their benefit. Secondly, whereas drugs trials regard the human factor as a contaminator, social change is brought about through the human agent.

Thirdly, the notion of control group is problematic: does it involve doing nothing, or simply doing nothing special? The usual decision is ‘business as usual’, but what that means will vary from room to room. As Pawson (2006: 51) argues, the control is ‘not a piece of apparatus at idle. This is not the world in repose… Control groups or control areas are in fact kept very busy.’

Finally, RCT outcomes have to be easy to measure. There is a consequent danger of neglecting more challenging aims, or judging them by more basic proxies. RCTs generally do better with basic arithmetic than creative arts, design or citizenship.

Whereas extensive theorisation precedes every medical trial, the designers of education RCTs seem to assume that the data speaks for itself. Typically, a recent EEF trial of a phonics-based remedial programme did not even ask what precisely were the difficulties and barriers for these learners. Simply, the script was followed, the outcomes measured and relative benefit calculated.

‘Current attempts to evaluate teaching methods ‘rigorously’ simply don’t live up to the complexity of school situations.’

Current attempts to evaluate teaching methods ‘rigorously’ simply don’t live up to the complexity of school situations. We lack a convincing model that brings together context, causality and interaction, and considers whether there is a stratified relationship between different forces and tendencies.

From this disturbing foundation, my article then examines problems in combining individual trials through ‘meta-analysis’ (more appropriately, ‘statistical synthesis’). Adrian Simpson’s (2017) mathematical scrutiny of the EEF toolkit was invaluable here, especially for pointing to the fundamental error in comparing effect sizes.

One of the toolkit’s central flaws, inherent in the contract between the EEF and its designers at CEM, is commonly known as ‘apples and oranges’. A good explanation can be found in a paper by Robert Coe (2002: 10) published some years before he became involved with the EEF project:

‘Given two (or more) numbers, one can always calculate an average. However, if they are effect sizes from experiments that differ significantly in terms of the outcome measures used, then the result may be totally meaningless.’

He explains that the effect sizes must relate to the same outcomes and similar treatments and populations. Without this ‘it makes no sense to average out their effects’ (ibid). It is troubling that the current political regime is embroiling experienced researchers in compromised and indefensible projects.

Terry Wrigley’s original article in the British Educational Research Journal (BERJ), ‘The power of ‘evidence’: Reliable science or a set of blunt tools?’, can now be read freely online until 31 July on Wiley Online. 


Bennett T (2013) Teacher Proof: Why research in education doesn’t always mean what it claims, and what you can do about it, London: Routledge

Coe R (2002) ‘It’s the Effect Size, Stupid: What effect size is and why it is important’, paper presented at the British Educational Research Association annual conference, Exeter, 12–14 September 2002.

Hattie J (2009) Visible Learning: A synthesis of over 800 meta-analyses relating to achievement, London: Routledge

Pawson R (2006) Evidence-based policy: A realist perspective London: SAGE

Simpson A (2017) ‘The misdirection of public policy: Comparing and combining standardised effect sizes’, Journal of Education Policy 32(4): 450–466

Wrigley T (2018) ‘The power of “evidence”: Reliable science ora set of blunt tools?,’ British Educational Research Journal 44(3): 359–376.