I recently got hold of a rather fun dataset: every English Premier League football match since it was established in 1992. With 22 teams for the first three seasons (a fact in itself which was news to me), and 20 from 1995/1996 season onwards, I have data on 6,706 football matches up to the end of last season. Awesome.
So, as with any new data set, the first thing to do (after cleaning it up) is to think about what it is we are interested in, and go about some simple, exploratory analyses. I found myself with some free time this evening, so have gone through getting together some fairly basic statistics, which I'll share here through the controversial (yes, really) yet commonly used pie chart. A subject I'll probably do a separate post about later.
My initial idea was to look at results, and in particular the impact of home advantage. Everyone knows that playing at home is supposed to give your team a boost, be it through familiarity with the ground, less travelling, more fans, knowing where the booby traps are, and so on. Indeed, this is an idea well-trodden by sports statisticians, for instance the guys behind Fink Tank assign a 0.5 goal advantage to the home team.
First up then, let's retread old ground and look just at match outcomes. Out of all 6706 games in the dataset, how many times does the home team win, how many times does the away team win, and how many times is it a draw?
To anyone with a passing interest in football, this won't be a particularly surprising chart. 46% (nearly half) of all games are won by the home team, with draws and away wins being about as likely as each other. This distribution of match results (or at least, a very similar one) has been used by the guys at the Winton programme for the public understanding of risk to look at to what extent the final standings in the Premier League table represent skill, and to what extent luck. If you're interested (and don't mind a little bit of maths), their work is well worth a read.
So what else can we do to explore the data? How about looking at what happens in each half of a game? The next two figures summarise the outcome of each half of a match, regardless of the overall match result.
Looking at each half individually, we see that the overall match outcome isn't obviously reflected. A draw suddenly seems much more likely, with 43% of first halfs and 37% of second halfs being tied. When you consider that this turns into just 27% of matches being drawn overall, it seems a little surprising.
The other thing we notice is that the second half is drawn considerably less often than the first; both teams are more likely to win the second half than the first half. Our exploratory analysis has thrown up our first potentially interesting question: why do second halfs result in fewer draws? It could be the result of a greater drive to win a game, an inspirational substitution or just teams getting tired (and so more sloppy) as the game wears on. We'll come back to this another time; for now we should do a bit more digging to see if any other interesting finds show up.
In the three figures on the left, I've looked at the final result of matches, depending on who (if anyone) is winning at half time. Firstly, it is unsurprising to see that if the away team is winning at half time, they're odds-on to go on and win the match outright, doing so 67% of the time. However, all is not lost if your team goes in losing at home at half time, they still have almost a 1 in 3 chance of salvaging something from the game, which seems pretty good consolation.
A draw at half time, meanwhile, suggests a draw at the final whistle is the most likely of the three possible outcomes, happening nearly 40% of the time. Again, though, we see the home advantage being demonstrated through the statistics. With the match drawn at half time, the away team goes on to win 1 in 4 times; a home win is much more likely.
Finally, we look at the opposite of our first chart, and see that if the home team is winning at half time they do a much better job of holding onto that lead. Whilst an away team leading at half time would go on to win two-thirds of such matches, a home team in a similar position go on to win 80% (four-fifths) of the time. The contrast is even more stark when we look at the probability of a comeback. A home team losing at half time will come back and win 10% of the time, so once every ten games. On the other hand, an away team losing at half time will only manage to win 5%, just once in twenty, of such games.
I'll finish off (after that pie chart overload) with a bit of fun (for some value of 'fun'). The following is a bar chart showing the goal difference at the end of a match from the home team's perspective (so +1 means the home team won by one goal, -2 means the home team lost by two goals). Notice how when we look at the data this way, one might mistakenly conclude that a draw is the most likely outcome, as the highest bar is for a goal difference of zero. The real story, of course, requires us to consider all of the bars, and one can easily see how this graph has been produced from the same data that told us 46% of home teams win.