I just had a quick look at the times goals were scored in the Premier League dataset:
If you'll excuse the somewhat poorly labelled x-axis (R is being fussy, and I'm not particularly inclined to try and fix it for something so trivial), there are a few interesting points.
Most obvious are the two huge bars at around half time and full time. Of course, as you might have already guessed, this is due to how goal times have been reported - goals in injury time in the first half are reported as a 45th minute goal, and in the second half as a 90th minute goal. The fact the 9oth minute bar is so much taller than the 45th minute one demonstrates something most of us have probably observed: second half injury time is almost always longer than first half injury time.
If we filter out these two anomolies, and look at the goal times without the 46th of 91st minute goals, we get the following:
A couple of things to notice here. The first is that goals in the first minute really do seem quite rare, occurring in just 82 games (once in over 200 games). The second is that there looks like there might be a slight pattern to goal times - it seems goals become a bit more likely as the match wears on. Is this the case, or are our eyes just deceiving us?
Fortunately, we can test this hypothesis using a statistical technique called linear modelling. What this means is that we assume that the number of goals scored for a particular minute have a linear relationship with the time at which they're scored. In other words, if we just plotted the above bar graph but with points instead of bars, the points would lie on a roughly straight line. In fact, let's do this and see how it looks. One thing we'll change is from goal frequency to percentage, the data are the same, but it will make more sense to talk about percentages later on.
It looks a lot clearer when we plot the data this way, with most of the points seeming to lie (very roughly) on a straight line. We also note that each minute seems responsible for around 1% of goals. With 90 minutes in a game we'd expect something like this, so we've provided ourselves with a useful 'sanity check' - never a bad thing when playing with data.
Having observed this pattern, can we make any use of it? It might be nice to be able to fit a model that could tell us how likely a goal in, say, the 10th minute of a match would be, if it's true that there is a simple, straight line relationship between time in the game and likelihood of a goal. One option would be to just put a ruler on the graph and try and draw a straight line that seemed to best fit the data (indeed, I can remember doing this when I was at school, back when I would call this a 'line of best fit'). Fortunately, we can use statistical software to do this for us, and in arguably a much more reliable way.
To put our model mathematically, suppose the percentage of goals scored in the Xth minute of the game is G, then we'd assume that G = a + bX, where a and b are some numbers we want to find out. Using statistical software we can fit this model through a method called least squares). What this method does is a bit like what you do when you put a ruler on the graph and try and move it around until it looks about right, with the same number of points above and below the line you want to draw. What your eye is doing when you do this is probably trying to minimise the total distance all of your points are from the line you're drawing; what least squares does is calculate the line that minimises the square of these distances. In other words, if you imagine a line drawn on the graph, and measure the distance from the line to each of the points, square these distances and add them all up, least squares will find you the line that makes this total the smallest.
Applying this method we find that a = 0.895 and b = 0.005, which we can then plug back into our model equation: instead of G = a + bX we now get G = 0.895 + 0.005X. This tells us that for every extra minute in the game, the percentage of goals scored in that minute goes up by 0.005. Admittedly, this isn't very much at all, but over the course of the game that works out to around a 0.45% increase from start to finish, which when you consider that the average minute will only have 1.11% of goals, seems a little more dramatic. For example, a goal in the 80th minute is 40% more likely than a goal in the 10th.
Finally, let's plot the line onto the previous graph, like so:
You might well think that it looks about as good as something you could have done by eye with a ruler, and you're probably right. However, another thing using a computational method can tell us is how 'significant' the effect of time on goal probability is. In other words, to what extent is there really an underlying relationship between the time in a game and the probability of a goal, and to what extent is this just the result of our data randomly falling into a pattern. We find that, mostly thanks to the sheer size of our dataset, that this pattern is not likely to be down to chance; there really does seem to be a linear relationship between time in a game and the probability of a goal going in.
Perhaps something to bear in mind after a goalless first half.