Settling the Moffat vs. RTD Debate (Using AI Scores)
Guest contributor Joshua Yetman examines the AI scores and finds some surprising results.
The AI scores – what are they, what do they tell us, and are they important?
You may have seen them used in arguments, seen them quoted in a news article, or you may never have heard of them at all. The Audience Appreciation Index – simply referred to as AI scores – is the BBC’s primary way of gauging the response to each and every one of its major programmes, and has been used extensively since the onset of television in the UK.
I have written this article to, simply, explain how AI scores work, their significance to Doctor Who, quote some very interesting results, and then – using statistics – I will hopefully try and settle a long-standing argument.
How does it work?
It’s simple really. In its current methodology (since 2012), over 20,000 UK citizens with a TV licence are demographically selected, asked to watch a range of programmes, and are then asked to give each programme they viewed a score of 10. The scores are collected online (currently via GfK UK). The AI score is then calculated by taking the average score of all the 20,000 people in the sample, and expressing it as a score out of 100.
Due to the large sample size, the AI score is seen as a very good estimate of the true reception to the programme, and so BBC bosses pay very close attention to the AI scores of its TV shows.
The Audience Appreciation Index has been operating in some form since 1936, and Doctor Who, being one of the Beeb’s longest running shows, has an extensive catalogue of AI scores (although, like quite a few episodes of the classic series, many AI scores are in fact missing).
However, the entirety of the classic series of Doctor Who was subject to a rather inept Audience Appreciation Index, which only operated on a 6-point scale. This led to rather questionable scores for episodes that are generally considered by fans to be the best of the best. Take “The Caves of Androzani” for example, one of the most loved stories in the show’s history; it got a seemingly lukewarm score of 66/100. Of course, one could argue that the popularity of “The Caves of Androzani”, like many classic stories, improved over time, but 66/100 still seems very low.
What does the score mean?
In its current format, an AI score above 85/100 indicates excellence, above 90/100 is exceptional, and below 60/100 is poor (and, if its below 55/100, it’s very poor). Programmes very rarely make it above 90/100, but the only programmes that ever seem make it below 55/100 are – surprise, surprise – party political broadcasts (a recent one by UKIP got a record low score of 21/100).
As the BBC is effectively a publicly funded body, it strongly desires for its programming to be of the highest quality, more so than most broadcasters, who tend to care more about the size of its audience. Of course, the BBC still cares about the size of its audience, and that is still the deciding factor in whether to axe a show or not, but the AI score is always a significant consideration.
Fortunately, for Doctor Who, the Beeb have nothing to worry about. The average AI score since 2005 when the show was revived is 85.75, indicating that, on average, Doctor Who achieves excellence by BBC standards. In fact, out of Doctor Who’s 104 episodes since 2005, 79 are considered at least excellent by this measure!
To put this score into perspective, the average BBC TV AI score is 82.3, so Doctor Who is well above average relative to other shows in the BBC’s creative arsenal.
Records of the revived era
The highest ever AI score achieved by an episode in the revived era is 91/100, achieved by both “The Stolen Earth” and “Journey’s End”, both parts of the Series 4 finale. These are the only two episodes in Doctor Who history to be considered exceptional by BBC standards. Make of that what you will.
Conversely, the lowest ever AI score achieved by an episode in the revived era is 76/100, achieved by – perhaps not surprisingly – “Love and Monsters”. Still, for all the hate that “Love and Monsters” gets, 76/100 is actually quite a decent score! Sure, it’s well below the BBC average, but if I got 76/100 in a test, I’d sure be pleased!
Considering each series, Series 4 has the highest average AI score with a respectable 87.8/100 followed closely by Series 3, whilst Series 1 has the lowest average AI score with a still admirable 82.8/100, with Series 2 not far behind. Series 5, 6, 7 take the middle positions in that order from best to worst.
Some seemingly abnormal results
Some AI scores seem to go against what we would expect. I’ll highlight a few examples here.
“The Curse of the Black Spot” received an AI score of 86/100, a score shared by “Human Nature”, “The Family of Blood”, “Midnight” and “The Eleventh Hour” among many others. For an episode so derided by fans, and considered the second worst episode of the whole Matt Smith era according to a poll on this very website, it managed to score an “excellent” score.
“Let’s Kill Hitler”, another typically scorned episode (although it remains one of my all time favourites), received a score of 85/100, putting it on par with “The Doctor Dances”, “The Impossible Planet”, and “The Girl Who Waited” among many other episodes, and also giving it “excellent” status.
Finally, the score of “Father’s Day” – a highly praised episode – was only 83/100, although that was still was one of the highest in Series 1.
So, are they reliable? Are they useful? Are they important?
Ultimately, we have to remember that the AI scores are just estimates. Nevertheless, the AI scores are what we call in the statistics world unbiased estimators, in the sense that they are representative of the population and, given an ample sample size (which it does have), they should be very close to the true average score given by the entire population. That makes them very useful.
But then how can we explain the abnormal results I highlighted previously? Possibly, those included in the sample for episodes like “The Curse of the Black Spot” enjoyed it more than the overall population, but due to the sample size, this is actually unlikely. Maybe the UK population enjoyed it more than the rest of us, and we don’t know about it!
The fact that episodes like “The Caves of Androzani” received such low AI scores seems like another problem of AI scores on face value, but we do have to consider that many episodes only truly become fan- favourites over time, meaning that AI scores – calculated straight after broadcast – are not representative of long-term reception. Also, “The Caves of Androzani” may have only received an AI of 66/100, but the BBC TV average back then was about 65/100. This stresses that the AI score should be used relative to other programmes or episodes rather than an absolute measure of quality, but we shouldn’t compare classic series AI to revived era AI as they are calculated differently!
Also, it’s important to say that, to us, these scores should mean nothing! We all have our own personal ratings, and no little statistic is going to influence our opinions in the slightest. However, the BBC doesn’t calculate the AI score for no reason. Like I said earlier, the BBC take AI scores very seriously, and although a large drop in ratings may well sound the death knell for shows like Doctor Who (which fortunately isn’t happening at the moment – in fact, the opposite!), the BBC will be very concerned if there is a large drop in AI as well. So, let’s hope those scores remain high!
Settling the Moffat vs. RTD argument
Statistics are powerful. With them, you can make wonderful conclusions. I personally think the AI scores are a very strong and accurate estimator, and so we can confidently throw around some hypotheses.
By the AI scores, is there evidence to suggest that the Moffat era differs in quality from the RTD era? With some quick calculations, the average RTD era episode scores a 85.57/100. The average Moffat era episode scores a 86.0/100. One is higher than the other, but, remember, these are samples. Using what we call in the statistics world as a two sample hypothesis test, I can conclude, resolutely and finally, that there is no difference in quality between the two eras at the 0.1% level (i.e. there’s only a 0.1% chance that I’m wrong).
So, hopefully, this should put an end to these mindless RTD vs. Moffat discussions, and if that isn’t the single biggest use to come out of the AI scores, I don’t know what is!