The UK’s A Level results were unusually disastrous this year, throwing into doubt our children’s futures as well as university admissions processes. What went wrong?
Discussion focused on the application of a seemingly new A Level algorithm to adjust grades according to a school’s prior performance, leading to outperforming students at traditionally underperforming schools being downgraded. Analysis revealed that the A Level algorithm also tended to benefit students at private schools, leading to accusations of social class bias.
We’d argue that the problem lay, not in the mathematics or coding of the A Level algorithm, but in the assumptions that drove its formulation. And where do those assumptions come from? They come from humans.
What is an algorithm?
An algorithm is a finite sequence of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation. They are used as specifications for performing calculations, data processing, automated reasoning, and other tasks. ‘If…then’ captures much of their processes.
An example is to find the largest number in a list of numbers of random order. Finding the solution requires looking at every number in the list. From this follows an algorithm, which says:
- If there are no numbers in the set, then there is no highest number.
- Assume the first number in the set is the largest number in the set.
- For each remaining number in the set: if this number is larger than the current largest number, consider this number to be the largest number in the set.
- When there are no numbers left in the set to iterate over, consider the current largest number to be the largest number of the set.
The logic is unambiguous, and the outcome reflects that logic. It short circuits human error and for a big data set it performs tasks beyond human capacity.
The concept of an algorithm has existed since antiquity. Arithmetic algorithms, such as a division algorithm, was used by ancient Babylonian mathematicians. Arabic mathematicians such as al-Kindi in the ninth-century used cryptographic algorithms for code-breaking.
With the Middle East generating much of our mathematical expertise, the word algorithm itself is derived from the ninth-century mathematician Muḥammad ibn Mūsā al-Khwārizmī, Latinised to Algoritmi.
How are algorithms used?
We’re not going to talk in detail about this. Algorithm writing and testing is a huge endeavour beyond our scope.
But in practice, the value of algorithms are that, firstly, it tells us the logical result of applying our assumptions. This often yields unexpected results because our previous experience or bias is leading us in another direction. The logic applied to our datasets informs us of the logical outcome. Secondly, if the result is unacceptable, it shows us that the assumptions we have applied to our data sets are themselves wrong, incomplete or unacceptable and need to be iterated.
So what happened with A Levels?
There were three datasets used to recalculate the 2020 A Level results:
- The exam results of students who took the same subject at the same school in 2017, 2018 and 2019.
- Prior attainment data on students at the same school in 2017, 2018 and 2019.
- Prior attainment data on this year’s students.
The outcome of this statistical model was placed on top of the rank order which teachers submitted to exam boards. When drawing up the rank order, teachers ordered students from best to worst for each subject.
Teachers were also asked to come up with a predicted grade for each student, based on their mock exams, tests, coursework and homework.
For new schools which did not have historic data, as well as small schools or those in which low numbers of students were taking particular subjects, teachers’ predictions were to be the “primary source” of evidence for their grades this summer.
The immediate reaction across the media, parents and students was that the results were unfair, disproportionately favouring private schools over public, overperforming schools over under. High achieving students – the ones that beat expectations – stood to lose their places at university.
The policy process
Ofqual published its algorithm used to calculate A level results when the results themselves were released. While there were reasons put forward for confidentiality before results were released, they deprived Ofqual and the Department of Education of the opportunity to get critical testing of their process. Ministers do not typically welcome critical testing of their process – but they do tend to find it preferable to critical examination of their decisions.
DfE’s guidance or instruction to Ofqual was to ensure that there would be no material variation from previous year’s A level results. A significant number of teachers overpredict outcomes. In previous years, up to half of A level students miss their predicted grades. Moving to ‘just’ predictions, the thinking went, would cause the 2020 results to be seen by universities and employers as devalued.
In 2020 the driving assumption that results should be comparable and not significantly divergent from those of previous years forced a series of unwelcome and ultimately unacceptable outcomes.
Algorithm injustices
Algorithms tend to have persistent problems with injustices, such as gender or racial biases. With the A Level algorithm, the problems this system failed to address were:
- It was a mixed system – some grades were teacher predicted; many more were not. That is always likely to be inconsistent and unfair.
- Appeal bias – if as a school head you thought 2020 was a weaker cohort than 2019, your motivation to assert unfairness is weakened. Your relief is private. In the converse case, you are motivated to protest unfairness, and publicly.
- Small class sizes are more common in the private sector so biased outcomes towards teacher predictions in their favour.
- Teacher predictions in the state sector tend to be more optimistic. Addressing that leads to reducing disproportionately the grades awarded pupils in state schools and doesn’t allow for unexpected performance of bright students.
What can we learn from this?
Here at Aptem we are very keen on machine learning. We believe it adds value and insight to everything training companies do. But such technologies are in their early days, and each incidence like the A Level results gives us an opportunity for learning. So here is what we think we have learned.
First, always trial the outcomes. It is better to trial process outcomes in a smaller experimental setting. Then adjust assumptions accordingly than go live and find the shortcomings.
Second, do not be driven by one assumption. The comparability of grades in comparison with 2019 led to the 2020 system being devised to avoid it. That notion has been found to not carry public weight compared with the resulting shortcomings.
Third, test the thinking widely. Not publishing the thinking behind the algorithm lost the opportunity to gain valuable feedback on its weaknesses before it was applied for real.
Fourth, if you have to carry the audience, get them with you stage by stage. The response to the 2020 process has led to using teacher predictions. In the time available and in view of the public pressure, no other outcome was possible. But now, the DfE has the outcome it wanted to avoid. The opportunity to make comparability part of the process was lost through a lack of consultation.
Science is our servant
The technology used for 2020 A levels is invaluable. But, it does not remove the need to apply human correction and input.
Some systems are genuinely entirely driven by logic and accepted as such. We do not feel a need manually to moderate lottery draws, for example.
Many more need our sense check. We all do algorithmic “if…then” calculations constantly. We need to have the confidence to test what is presented to us as science; to be secure in the knowledge that it works for us as opposed to confounding us. Science is our servant, not our ruler.