All models are wrong.: I'd like to be, under the sea...

An octopus named Paul has been making the news due to his alleged ability to correctly predict the winner of Germany's international football matches. It started with Euro 2008, where he supposedly called 4 of Germany's 6 games correctly. The BBC reported this as "nearly 70%", which is perhaps being a little generous, as a 70% success rate sounds rather more impressive than correctly making 4 out of 6 50/50 guess.

For the World Cup Paul has (apparently) correctly picked the results of Germany's 5 games up until tonight's semi-final, where he has controversially chosen Spain to triumph. So is Paul a Predicting Phenomenon, or just lucky?

We'll start with his World Cup picks where he's got 5 out of 5 right (so far). Our null hypothesis is that Paul is merely picking at random, and since each pick is a 50/50 choice this is the same as saying the probability he picks correctly is 0.5. The probability of getting 5 correct selections is then the same as tossing a coin 5 times and getting 5 heads. This is easy to compute, as we just multiply the probabilities together to get 0.5 x 0.5 x 0.5 x 0.5 x 0.5 = (0.5)⁵ = 1/32 or about 3%. That seems pretty unlikely (although not too astronomical).

Amusingly, were we performing a statistical hypothesis test, we would in fact be likely to say that the data are not consistent with the null hypothesis that Paul is picking at random. This is because the probability that he would have got all 5 predictions correct is less than 5%, the standard cut-off used in hypothesis testing (we would say "the data are significant at the 5% level"). Of course, this highlights the danger of the common practice of just looking at a p-values (which is what our probability above is) and concluding that the null hypothesis must be true or false - it would take a rather stronger run of successes to convince most people that an octopus could really correctly predict football results. A 5% significance level means that even if our null hypothesis is true, an outcome will appear 'significant' (and we would question the null hypothesis) if the chances of it happening are less than 1 in 20. This really isn't that unlikely.

We do have more data, however, thanks to Paul's Euro 2008 picks. This takes his record to 9 correct out of 11 - is this statistically significant as well? Once again we want to calculate the probability that Paul would get this success rate picking at random, but it's slightly harder to work out this time. What we want to know is the probability that Paul would be at least this successful were he picking at random. So whereas before we just had to calculate the probability of 5 heads from 5 tosses, here we need to calculate the probability of 9 heads from 11 tosses, 10 heads from 11 tosses, and 11 heads from 11 tosses; adding these three probabilities up will tell us how 'lucky' Paul is.

So 11 heads from 11 tosses is easy, like the case with 5 out of 5, it's just 0.5 multiplied by itself 11 times. What about 10 heads, or 9? Things get a little trickier. Whilst there is only one way to get 11 heads from 11 tosses, there are several ways to get 9 heads. This might sound odd, but if you imagine tossing a coin twice, you can get either:

1) Tails followed by tails (TT)
2) Heads followed by heads (HH)
3) Tails followed by heads (TH)
4) Heads followed by tails (HH)

All of these outcomes are equally likely, but two of them (HT and TH) correspond to getting one head and one tail, and it's this which makes computing the probability of 9 heads from 11 tosses a bit tricky. Fortunately there's a simple formula for calculating this, known as the binomial coefficient. I'll spare you the details (since it's mathsy, and you can read Wikipedia if you like), and tell you how to use Google to get the number you want. Just type in "x choose y" and Google will tell you how many ways there are to get y heads from x coin tosses. Here, we want 11 choose 9, which gives us 55 ways to get 9 heads from 11 tosses. The probability of getting any one particular combination of 9 heads and 2 tails is just 0.5 multiplied by itself 11 times; once for each 50/50 coin toss. Since there are 55 different ways of doing this we then want 55 times this to allow for each possibility. So the final probability that one would get 9 heads from 11 tosses is 55 x (0.5)¹¹, about 2.7% or 1 in 37.

Similarly, we see there are 11 ways to get 10 heads from 11 coin tosses, so the probability of exactly 10 heads is 11 x (0.5)¹¹, about 1 in 186.

We can now put these three probabilities together and add them up to give Paul's prediction p-value as (1 + 11 + 55) x (0.5)¹¹ = 3.3% or about 1 in 30.

So even taking Paul's two mistakes into account, his four extra correct picks mean the chances of him managing his record at random have only increased marginally, and his punditry powers which remain statistically significant.

So how important is the match tonight? He's selected Germany's opponents Spain to progress, and of course if he's right it will be further evidence that his powers are not merely down to chance, so what if he's wrong? It would take his world cup prediction record to 5 right out of 6, would this still be statistically significant? The probability of getting at least 5 out of 6 right is calculated the same as our example above with 9 out of 11. The probability of 6 out of 6 is just (0.5)⁶, and there are 6 ways of getting 5 heads from 6 tosses, so the probability of exactly 5 out of 6 is 6 x (0.5)⁶. Adding these together we get a probability of about 11%, or 1 in 9. With just one wrong choice his picking would stop being statistically significant.

What about his lifetime record? That would go to 9 out of 12. I'll spare you the maths now and just tell you that the probability of getting at least 9 out of 12 right is 7.3%, or 1 in 14. Again, statisticians would stop heralding Paul as the mussel eating messiah.

So if he's wrong tonight he'll seem unremarkable (to p-value cultists, at least), whilst if he's right he'll be pushed further towards probabilistic stardom. This does of course demonstrate the dangers of trying to perform statistics entirely through p-values (which many practitioners do), and how susceptible they can be towards even a single result one way or the other.

Now I'm off to watch the football where I'll be rooting for the Germans. If they win the World Cup it means England come joint second, right?

2 comments:

Anonymous8 July 2010 at 14:56
There is of course a detail that you've (likely intentionally) overlooked, which is that the chance of winning a football match is not usually exactly 50-50. Given that Germany are one of the world's top teams they could be expected to win more matches than they lose.

Having done a bit of research, I've discovered that since Paul started making his predictions (at least for the public) at the start of Euro 2008, Germany have won 22 games, lost only seven and drawn four. Their win record thus stands at 66.7% over the last two years, which is probably a fairer representation of their chances of victory in a randomly determined match.

Can your analysis take account of this?

Secondly, I would note that there are three possible results in most football matches (win, loss, draw) rather than two, although there seems to be no way for Paul to predict anything other than a win or loss. So far none of the matches he has made predictions for have resulted in a draw, but the possibility exists nonetheless. How does that affect the overall dataset?
Michael10 July 2010 at 18:35
Good questions, I've put up a new post that hopefully addresses them.

All models are wrong.

Wednesday, 7 July 2010

I'd like to be, under the sea...

2 comments:

Followers

Blog Archive

About Me