All models are wrong.

Monday, 2 August 2010

A thorough investigation into popular opinion of statistics

(In the interests of full disclosure, I removed a bunch of results for things such as "9 out of 10 statistics are made up" which generated artificial hits for lower numbers (10 in this example).)

I think the graph speaks for itself: most people think most statistics are made up. A couple of bonus (not entirely made up) stats are that the modal hit was "87% of statistics are made up", and the mean hit was "74.8% of statistics are made up". So next time someone tells you an alarming statistic don't worry, three-quarters of the time it will have just been made up. Possibly.

Saturday, 10 July 2010

Mega Football Versus Small Octopus

On my previous post about Paul the octopus, a commenter asked a couple of questions which I thought merited a separate post to address. The first:

"There is of course a detail that you've (likely intentionally) overlooked, which is that the chance of winning a football match is not usually exactly 50-50. Given that Germany are one of the world's top teams they could be expected to win more matches than they lose.

Having done a bit of research, I've discovered that since Paul started making his predictions (at least for the public) at the start of Euro 2008, Germany have won 22 games, lost only seven and drawn four. Their win record thus stands at 66.7% over the last two years, which is probably a fairer representation of their chances of victory in a randomly determined match.

Can your analysis take account of this?"

This is an interesting question, and it boils down (as happens surprisingly often with probability) to a matter of perspective.

Suppose you have a friend called Peter who knows a bit about football. He's successfully predicted the results of the same six games that Paul has. Since Peter knows about football, he knows that the chance of Germany beating Australia (for example) was probably not exactly 50%. Does this matter?

Well, not really. The analysis we carried out last time was testing a specific hypothesis - that Paul was picking teams at random. This was our 'null' hypothesis, our default state of belief, if you will. Our 'alternative' hypothesis was that he has done better than you'd expect him to by chance alone. In testing this I claimed that Paul's chance of predicting the winner - if he's just picking at random - is 50/50. Crucially, this doesn't depend on the real chances of either outcome. This might not seem intuitive at first, but imagine Paul was picking the team after the game had happened - at this point the winner is known, so if he's picking at random he has a 50% chance of picking the right team. Since Paul's picking doesn't (we presume) interfere with the outcome of the game, if we're assuming he's picking 'blind' then it doesn't matter whether he chooses before or after the result is determined.

So what about Peter? We could test the same hypothesis and we would come to the same conclusion. The only difference is that we're not (as) impressed because we would expect him to be doing better than chance anyway - he has extra information to help make his decisions. Paul, meanwhile, is just an octopus, and so no-one would expect him to know anything (except possibly how to count to eight).

On a separate note, and with regards to the probabilities we've calculated telling us that Paul has some apparently incredible ability, it's worth stressing that that isn't what we've shown either. All we've done is shown that if Paul was picking at random (as - call me a sceptic - he probably was) he's just got quite lucky. This in itself isn't really that remarkable though - Paul was only brought to our attention after a string of successful predictions. There may well have been hundreds of other octopuses/coins/babies making similar predictions and getting them wrong, and we've just got to see the one who got them right. If you see a golfer hit a hole in one it seems remarkably improbable, but if you think about all the millions of shots that didn't go in, that single event occurring doesn't seem so incredible.

But anyway, on to the second question:

"Secondly, I would note that there are three possible results in most football matches (win, loss, draw) rather than two, although there seems to be no way for Paul to predict anything other than a win or loss. So far none of the matches he has made predictions for have resulted in a draw, but the possibility exists nonetheless. How does that affect the overall dataset?"

This is a good point (and one which I ignored previously for the sake of keeping things simple), and an interesting one to discuss.

As we've discussed, if our (null) hypothesis remains that Paul is picking at random, his probability of picking either team is just 0.5. However, since it's not certain that one of those teams will go on to win the game, his chance of picking the winner is actually going to be less than that. For instance, if 2 in 3 games end in one of the two teams winning, Paul then has a 1 in 3 chance of picking the team that wins, a 1 in 3 chance of picking the team that loses, and a 1 in 3 chance of there being no winning team to be picked at all.

What this amounts to is that Paul's chances of correct predictions are in fact even lower than those we'd already calculated, but unless you are willing to believe an octopus has been keeping an eye on the football pages of Bild, the chances are he's just very lucky.

Wednesday, 7 July 2010

I'd like to be, under the sea...

An octopus named Paul has been making the news due to his alleged ability to correctly predict the winner of Germany's international football matches. It started with Euro 2008, where he supposedly called 4 of Germany's 6 games correctly. The BBC reported this as "nearly 70%", which is perhaps being a little generous, as a 70% success rate sounds rather more impressive than correctly making 4 out of 6 50/50 guess.

For the World Cup Paul has (apparently) correctly picked the results of Germany's 5 games up until tonight's semi-final, where he has controversially chosen Spain to triumph. So is Paul a Predicting Phenomenon, or just lucky?

We'll start with his World Cup picks where he's got 5 out of 5 right (so far). Our null hypothesis is that Paul is merely picking at random, and since each pick is a 50/50 choice this is the same as saying the probability he picks correctly is 0.5. The probability of getting 5 correct selections is then the same as tossing a coin 5 times and getting 5 heads. This is easy to compute, as we just multiply the probabilities together to get 0.5 x 0.5 x 0.5 x 0.5 x 0.5 = (0.5)⁵ = 1/32 or about 3%. That seems pretty unlikely (although not too astronomical).

Amusingly, were we performing a statistical hypothesis test, we would in fact be likely to say that the data are not consistent with the null hypothesis that Paul is picking at random. This is because the probability that he would have got all 5 predictions correct is less than 5%, the standard cut-off used in hypothesis testing (we would say "the data are significant at the 5% level"). Of course, this highlights the danger of the common practice of just looking at a p-values (which is what our probability above is) and concluding that the null hypothesis must be true or false - it would take a rather stronger run of successes to convince most people that an octopus could really correctly predict football results. A 5% significance level means that even if our null hypothesis is true, an outcome will appear 'significant' (and we would question the null hypothesis) if the chances of it happening are less than 1 in 20. This really isn't that unlikely.

We do have more data, however, thanks to Paul's Euro 2008 picks. This takes his record to 9 correct out of 11 - is this statistically significant as well? Once again we want to calculate the probability that Paul would get this success rate picking at random, but it's slightly harder to work out this time. What we want to know is the probability that Paul would be at least this successful were he picking at random. So whereas before we just had to calculate the probability of 5 heads from 5 tosses, here we need to calculate the probability of 9 heads from 11 tosses, 10 heads from 11 tosses, and 11 heads from 11 tosses; adding these three probabilities up will tell us how 'lucky' Paul is.

So 11 heads from 11 tosses is easy, like the case with 5 out of 5, it's just 0.5 multiplied by itself 11 times. What about 10 heads, or 9? Things get a little trickier. Whilst there is only one way to get 11 heads from 11 tosses, there are several ways to get 9 heads. This might sound odd, but if you imagine tossing a coin twice, you can get either:

1) Tails followed by tails (TT)
2) Heads followed by heads (HH)
3) Tails followed by heads (TH)
4) Heads followed by tails (HH)

All of these outcomes are equally likely, but two of them (HT and TH) correspond to getting one head and one tail, and it's this which makes computing the probability of 9 heads from 11 tosses a bit tricky. Fortunately there's a simple formula for calculating this, known as the binomial coefficient. I'll spare you the details (since it's mathsy, and you can read Wikipedia if you like), and tell you how to use Google to get the number you want. Just type in "x choose y" and Google will tell you how many ways there are to get y heads from x coin tosses. Here, we want 11 choose 9, which gives us 55 ways to get 9 heads from 11 tosses. The probability of getting any one particular combination of 9 heads and 2 tails is just 0.5 multiplied by itself 11 times; once for each 50/50 coin toss. Since there are 55 different ways of doing this we then want 55 times this to allow for each possibility. So the final probability that one would get 9 heads from 11 tosses is 55 x (0.5)¹¹, about 2.7% or 1 in 37.

Similarly, we see there are 11 ways to get 10 heads from 11 coin tosses, so the probability of exactly 10 heads is 11 x (0.5)¹¹, about 1 in 186.

We can now put these three probabilities together and add them up to give Paul's prediction p-value as (1 + 11 + 55) x (0.5)¹¹ = 3.3% or about 1 in 30.

So even taking Paul's two mistakes into account, his four extra correct picks mean the chances of him managing his record at random have only increased marginally, and his punditry powers which remain statistically significant.

So how important is the match tonight? He's selected Germany's opponents Spain to progress, and of course if he's right it will be further evidence that his powers are not merely down to chance, so what if he's wrong? It would take his world cup prediction record to 5 right out of 6, would this still be statistically significant? The probability of getting at least 5 out of 6 right is calculated the same as our example above with 9 out of 11. The probability of 6 out of 6 is just (0.5)⁶, and there are 6 ways of getting 5 heads from 6 tosses, so the probability of exactly 5 out of 6 is 6 x (0.5)⁶. Adding these together we get a probability of about 11%, or 1 in 9. With just one wrong choice his picking would stop being statistically significant.

What about his lifetime record? That would go to 9 out of 12. I'll spare you the maths now and just tell you that the probability of getting at least 9 out of 12 right is 7.3%, or 1 in 14. Again, statisticians would stop heralding Paul as the mussel eating messiah.

So if he's wrong tonight he'll seem unremarkable (to p-value cultists, at least), whilst if he's right he'll be pushed further towards probabilistic stardom. This does of course demonstrate the dangers of trying to perform statistics entirely through p-values (which many practitioners do), and how susceptible they can be towards even a single result one way or the other.

Now I'm off to watch the football where I'll be rooting for the Germans. If they win the World Cup it means England come joint second, right?

Tuesday, 1 June 2010

Eurovision Eurovision Eurovision

Eurovision has been and gone, and love it or hate it it provides some nice data which I can use to demonstrate some statistics (hurrah). Let's get eurostatting:

One commonly held belief about Eurovision is that it's much better to perform early or late in the running order rather than somewhere in the middle. This is thanks to the serial position effect; we generally remember items in the middle of a list less well than those at the beginning (primacy effect) and end (recency effect). This year, however, a change was made to the Eurovision voting process - viewers could vote for their favourite song all the way through the competition, not just after hearing the final act. When I first heard this I was a bit nonplussed - how does letting people vote before they've heard all the songs make things fairer? I wonder if it will even have any effect...

Predictably, this led me to stay up til the early hours playing with data as I tried to answer two questions:

1) Is there evidence of a primacy/recency effect in Eurovision results?
2) Were there any appreciable changes to voting patterns this year, after the introduction of the new voting system?

To start with (as always) I go data mining. Thanks to the Internet, I can quite easily get hold of the results of as many Eurovision finals (and from 2004 onwards, semi-finals) as I'd like. I decided to take my dataset from 1998 onwards, as this is the first year where universal televoting was recommended, and so seem the most relevant to the present day.

So how do we go about investigating question 1? Whenever I start exploring data I always like to try and make some plots - the human eye is great at picking out patterns (admittedly sometimes where there aren't any to begin with...), and graphics are a great way to communicate data. So which data do I want to look at? I'm interested in identifying whether performing later or earlier means a country does better, and so for that I'm going to want the order in which they performed and the position they finished in. Is this good enough? Not quite. Because the number of countries entering the contest has fluctuated over the years (as well as differences between finals and semi-finals), from year to year the numbers are not yet comparable. For example, knowing a country finished 10th or performed 15th is a little meaningless if we don't know how many others it was competing against.

To make our numbers comparable we need to standardise them - fortunately a fairly easy procedure. For each of our individual contests we just divide a country's finishing position and performance order by the total number of countries competing in that particular competition. For example, a country finishing 25th out of 25 will be converted into a finishing 'score' of 25/25 = 1. Meanwhile, a country finishing first will have a lower finishing 'score' the more countries it was competing against (finishing 1st out of 10 would score 0.1, and is a better result than finishing 1st out of 5, which would score 0.2). The same logic is applied to performance order, so performing last always scores 1 and performing 1st scores less the more countries that are competing.

Now that we've standardised our data, we want to get back to plotting them, right? But what's the best sort of plot to use? All of our data are pairs of points - one finishing score and one performance order score, so can we just plot these as a scatter graph? Let's try that and see what happens:

Yikes. That's quite a mess. There are no particularly obvious patterns, so what do we do now? I think we need to manipulate our data a bit more to make it more accessible (and amenable to a different type of analysis).

We're going to simplify the data a little. Rather than looking at the specific finishing position and performance order for every country, we shall instead split them into quartiles. That is, we reduce our data to whether a country performed in the first, second, third or final quarter of contestants in a competition, and similarly whether they finished in the top, second, third of bottom quarter. Doing this, we can tabulate the simplified results:

As is hopefully discernable, each column corresponds to a performance order position - 1 means the first quarter, 4 the last quarter. Similarly, each row corresponds to a finishing position - 1 means finishing in the top quarter and 4 in the bottom quarter. We're interested in whether performance order affects finishing position, so we can make these data a little easier to interpret if we take column percentages - that is, for each column we calculate what proportion of countries that performed in that quarter then finished in the top, second, third and bottom quarter.

It's still a bit of a sea of numbers, but we can already see some interesting results - countries performing in the first quarter of a contest tend to do quite poorly, with 35.8% of such countries finishing in the bottom quarter, and 67.9% (just over two-thirds) finishing in the bottom half. Pretty much the opposite happens for countries performing in the final quarter; 34.4% go on to finish in the top quarter and 63.9% (just under two-thirds) finish in the top half. It seems our initial hypothesis was only half right - there's evidence here of a recency effect but not a primacy one. But could this just be down to chance?

Here we are interested in testing a hypothesis, specifically whether there is evidence of an association between performance order and finishing position. In statistical terms, this is our 'alternative hypothesis'. This is as opposed to a 'null hypothesis', which for us is that there is no association between performance order and finishing position. What a hypothesis test does is look at the data and ask whether or not it seems plausible they could have come about under the null hypothesis, in other words, is the pattern we think we see above merely due to chance?

The data are now in a rather nice format with which to perform Pearson's chi-square test. Put simply, this test takes our null hypothesis (that performance order has no impact on finishing position) and looks at how much the actual results deviate from what we would expect were this really the case. It's a powerful procedure, but also a fairly simple one, and whilst I shan't go into the mechanisms of it here, the wikipedia page explains it fairly well, and is hopefully penetrable to most with some A level maths in them.

From our tables above, it looks like our null hypothesis of no relationship between finishing position and order performance is false, but what does the statistical test say? The main output of the test I'm going to use here is a p-value, which is a commonly used means of testing a hypothesis. Discussion of p-values is really a post in itself, so I shan't go into too much detail here. What I will say, however, is that in most cases if a p-value is calculated as being less than 0.05 many will consider this reasonable evidence that the data being investigated are not consistent with the null hypothesis. In our case, a p-value of less than 0.05 would imply that there is evidence that our data do not seem to agree with the null hypothesis of no association between a country's performance order and finishing position.

Running the test, we get a p-value of 0.0001, which is much, much smaller than 0.05. Consequently most statisticians (myself included) would be happy to conclude that the data do not seem at all consistent with the null hypothesis; there is evidence of an association between performance order and finishing position.

As for question 1 then, we've established that there does indeed seem to be a relationship between finishing position and performance order. I should stress however, that we haven't actually shown what sort of relationship it is. Our statistical test just tells us that our observed data deviate from what we would expect sufficiently much to suggest they aren't just being scattered at random (there are things we could do to investigate the relationship further, but I think that's a tangent that will have to wait for another day). From the tables above though, it seems that countries who perform later do better, whilst those that perform earlier to worse - there is evidence of a recency effect, but not a primacy one. Who'd've thought after two hours of music you wouldn't remember the opening act?

But anyway, now that's dealt with we can finally move onto our second question - do we have any evidence that with the introduction of a new voting system anything has changed? To test this, we'll use the data from this year's contest - two semi-finals and a final, and take a similar approach. One complication emerges, however - Spain performed twice in the final after a stage invasion during their first performance - how can we take this into account? I've decided to just drop them altogether from the analysis, as there does not seem to be an obvious way to include them, and they are clearly a rather distinct case from all the other entries.

Having done this, we once again, split performance order and finishing positions into quarters, and report our results in a table:

Or, we can conver to column percentages again (that is, for each performance order quarter we can see what proportion of countries finished in each quarter overall). You'll have to forgive the odd rounding error...

To the eye, it's not quite as clear cut as it was with the older data, although the largest proportion of countries appear in the top right and bottom left cells as before. If we look a bit further though, there's less convincing evidence - recall that earlier over two-thirds of countries who performed in the first quarter went on to finish in the bottom half, here that proportion is just 57.1%. Furthermore, until this year 63.9% of countries who performed in the last quarter finished in the top half, this year that figure is 50%, just what you'd expect. Maybe things have changed...

Let's forget the guesswork though, we can just do another Pearson's chi-square test, right? Well unfortunately we can't. Pearson's chi-square test requires us to have sufficiently many observations to make some of its underlying assumptions valid, and we just don't have enough data. Fortunately there is another test - Fisher's exact test - which we can use when our sample size is this small. Like Pearson's test, it's fairly easy to compute (although again I'll spare the details), and running it we get a p-value of 0.6381. This is rather large, and suggests that our data are consistent with the null hypothesis - in other words, it seems that performance order doesn't have an effect on finishing position under the system.

I would, however, not set too much store by this conclusion. As mentioned, this is based on just three 'contests' - two semi-finals and a final - and so our test is not particularly powerful. When we have fewer data it is much harder to convince ourselves that we have found evidence of some sort of relationship - there is too much that can change due to chance. For example, if you toss a coin 100 times and get 30 heads and 70 tails you'd be fairly suspicious about it being biased. If you tossed it 10 times and got 3 heads and 7 tails however, you'd probably just think this was reasonable for a fair coin, and think this disproportionate result was just down to chance.

Still, it's a promising start, and it will be interesting (assuming this new voting system is maintained) to see how future years' data stack up when combined with what we have. Maybe it's not so silly to let people vote as they go along after all...

Thursday, 13 May 2010

Doing it by Degrees

When it comes to looking at university education, many do not have to think too hard about what they want to study, the bigger dilemma is where to study it. That said, university prospectuses will often tout various statistics to try and lure potential students into plumping for a particular course, with employability one of the more commonly seen figures. But is this a reliable metric of how 'valuable' a degree is? Or is it yet another example of STATISTICS ABUSE? (roll opening credits)

Every year AGCAS produces a report looking at destinations of graduates six months after graduation. The latest report, from 2009, is the result of questionnaires sent to all graduates from the 2007/8 academic year, and can be downloaded here. The report itself contains quite a lot of interesting data, with destination breakdown (how many graduates are employed, unemployed, or studying for further degrees) as well as stats on the types of job those in employment have found themselves in. With these statistics available for a number of subjects (or subject areas), we can get a feel for which subjects seem the most or least valuable.

The report provides details on the following subjects/subject areas:

Science

Biology; Chemistry; Environmental, Physical Geographical and Terrestrial Sciences; Physics; Sports Science

Mathematics, IT and Computing

Computer Science and Information Technology; Mathematics

Engineering and Building Management

Architecture and Building; Civil Engineering; Electric and Electronic Engineering; Mechanical Engineering

Social Sciences

Economics; Geography; Law; Politics; Psychology; Sociology

Arts, Creative Arts and Humanities

Art and Design; English; History; Media Studies; Languages; Performing Arts

Business and Administrative Studies

Accountancy; Business and Management; Marketing

So let's start with employment, surely a perfectly good benchmark of how 'good' a degree is. The AGCAS report splits graduates into those in UK employment, overseas employment, as well as those working and studying. We add these three together to give us our employment figures:

Top 5 for Employment:

Civil Engineering (78.3% employed)
Marketing (74.6%)
Business and Management (73.6%)
Architecture and Building (73.4%)
Accountancy (73.0%)

Bottom 5 for Employment:

Law (35.2%)
Physics (37.9%)
Chemistry (44.0%)
Biology (58.0%)
History (58.7%)

I think it's fair to say there are some surprises here. Marketing, and Business and Management two subject areas often cited as housing archetypal 'Mickey Mouse' degrees make the top 5, whilst historically 'tough' subjects like chemistry and physics are at the opposite end. Are people really better off studying business over biology? Or is there something wrong with our metric?

Naturally, I'm inclined to believe the latter, and with good reason. As is so often the case, one statistic does not tell the whole story; whilst these numbers tell us what proportion of graduates were employed six months after graduation, it is not simply the case that everyone else was unemployed. AGCAS reports a number of 'studying' statistics as well, such as those studying for a higher degree, a PGCE, or professional qualifications. Perhaps then, unemployment is a better way of assessing degrees, as this takes people who are 'employed' with study into accout. Let's see what happens:

Top 5 for Unemployment:

Law (5.5% unemployed)
Sports Science (5.6%)
Geography (6.4%)
Civil Engineering (7.0%)
Psychology (7.4%)

Bottom 5 for Unemployment

Computer Science and Information Technology (13.7%)
Media Studies (12.3%)
Art and Design (12.2%)
Electrical and Electronic Engineering (11%)
Accountancy (10.9%)

Quite a big change. Law jumps from worst for employment to best for unemployment (as you might expect, they're all studying), and accountancy has done the opposite. There are still some surprises, such as Computer Science and IT having the highest rate of unemployment, and another 'Mickey Mouse' course in the form of Sports Science being second best. However, this seems a much less debatable statistic than employment, and so it seems reasonable to take these figures at face value.

There is, of course, an issue we have yet to discuss, which will be a rather pressing one for many new graduates: money. What good is being employed if you're only getting paid £5 an hour for those fancy letters after your name?

The salary data in the AGCAS report are a little harder to find, let alone digest. Whilst we get nice pie charts and percentage breakdowns for destinations, discussion of salaries is restricted to an introductory paragraph. If we trawl through these, however, we do get some numbers, and merging them all together we can do another top and bottom 5, this time based on the average salary of respondents.

Top 5 for Salary

Economics (£24065)
Civil Engineering (£24006)
Architecture and Building (£23689)
Mechanical Engineering (£23683)
Electrical and Electronic Engineering (£22372)

Bottom 5 for Salary

Art and Design (£15656)
Media Studies (£16295)
Psychology (£16500)
Sports Science (£16627)
English (£16642)

Once again, a rather marked change. Media Studies keeps the bottom 5 place it enjoyed under the unemployment stats, but it is joined by Sports Science, which was second best for unemployment. There are no real surprises in our top 5, however, all these subjects having a fairly substantial pedigree.

For the sake of argument, then, let's suppose that you are most interested in average salary. As I mentioned, the AGCAS report makes it much easier to find the employment/unemployment figures for a subject than it does to find average salaries. Do these provide an adequate indicator of average salary? Our top/bottom 5s above would suggest not, but these only cover 10 of 26 subjects. Let's plot some graphs!

First up, average salary against employment, is there a strong link between the two?

Hmm, no obvious pattern there, then. How about unemployment, does that give us a better fit?

There doesn't seem to be any sort of pattern there either.

We can in fact calculate a number that gives us an idea of how closely related two sets of numbers are. The correlation coefficient between two sets of (x,y) points (like our (employment %, salary) points on our graph) varies from -1 to 1. If it's close to 0 that means our numbers are not closely related, whilst if it is close to +1 or -1 it suggests a strong relationship. For example, if in our plot of employment against salary above all our points seemed to be on a straight line, this would suggest a correlation of around 1 or -1. The sign indicates the direction of the correlation. If it's positive this means as salary increases, so does employment. If it's negative, then as salary increases, employment decreases. This doesn't mean the two are related - "correlation does not imply causation" is one of a statistician's many mantras - it just shows that these data happen to have an association (which we may go on to convince ourselves is a causal one).

So that diversion aside, what correlations do we get in our two plots above? Looking at them, we'd expect it to be close to zero; there doesn't seem to be much of a pattern in either of them. For the first plot, of employment against salary, we find a correlation coefficient of 0.12 - so not much of a surprise there. For unemployment it's even worse: 0.06. In short, neither employment nor unemployment is a good indicator of average salary.

There is one area of the AGCAS report we haven't discussed, however, which might prove useful. Whilst each subject has a page of percentages of those in employment, studying, and so on, it also has a page showing what types of jobs are held by those who are employed. These range from a variety of 'Professionals' down to 'Numerical Clerks and Cashiers', and 'Retail, Catering, Waiting and Bar Staff'. This last one doesn't sound too glamarous; you've just spent 3 years earning a degree and you're still working in a bar? More to the point, these jobs are going to be low paying, so hopefully they're a better indicator of average salary. Let's see:

There definitely seems to be a pattern there, and the correlation between the two variables is -0.88 - that's a pretty strong negative correlation. The higher the proportion of those employed in retail, the lower the average salary. Not a surprising result, but it's always worth checking these things.

Is this at all useful, though? The salary data are in the document, you just have to dig for them a bit more. There is, however, one thing we've not mentioned. Because the report doesn't give average salaries the same prominent treatment as the employment data, some numbers are, in fact, missing. Whilst we can see what proportion of history graduates are studying in the UK for a teaching qualification, we can't find their average salary six months after graduation (and the same goes for performing arts). However, because we've identified the percentage of those working in retail as a useful indicator of average salary, we can use this knowledge to predict the average salaries of history and performing arts graduates. (In statistics, we'd call our retail statistic an 'instrument' for salary.)

So how do we turn our retail employment data into a prediction of salary? If you read my previous post about the times goals are scored in football matches, you should already know where I'm going with this. If not, then go and read it now, and come back when you're ready to apologise for such an oversight.

So anyway, it's time for some more linear regression. We're looking to fit the model S = a + bR, where S is salary, and R is the percentage of those employed who are employed in retail. If we can estimate a and b, then we can use this equation to estimate S when we only know R, as is the case for history and performing arts degrees. We can also plot a cool line on our graph to show the trend. Running the numbers, we find a = 25014 and b = -468, and plotting the line this generates onto our graph gives us:

We can now either use our equation S = a + bR with a and b replaced with 25014 and -468, or read straight off the line on our graph. For both history and performing arts, retail employment was 17.4%, so plugging R = 17.4 into this equation gives S = 25014 - 468*17.4 = £16,870.40. Our model suggests that both subjects seem to lead to (relatively) low average salaries, something which would not have been easy to discern from the report alone.

Alas, this all assumes our model is accurate, and with a relatively small number of observations I wouldn't be inclined to place too much confidence in these conclusions. Here I've taken a single report to base rather a lot of analysis on. However, it does illustrate a couple of interesting points. Firstly, mere 'employability' figures seem a rather dubious metric on which to base the value of a degree. Perhaps more surprisingly, unemployment doesn't seem to be a particularly good one either, at least in terms of indicating average salary. Whilst this report did have salary data in it, they weren't as clearly laid out as the other data, and were in fact missing for some subjects. This has allowed us to demonstrate how you can use another variable (if you think it's a good enough surrogate) to estimate missing data. Whilst for this particular problem you're probably better off just trying to hunt down the data you want in another report, our way is clearly much more fun.

Wednesday, 5 May 2010

How many horses?

So the Lib Dems sent me some election material this morning. Unfortunately for them, ours is a very safe Labour seat, as you can see from this bar chart of the last election:

Not particularly pretty, but fairly clear, I think. Labour have a big majority, the Lib Dems are a (relatively) distant second, and the Tories and Greens are pretty much just making up the numbers.

So, how did the Lib Dems choose to present these data in their election leaflet? Like this!

Crikey. They say it's a two-horse race, and it really does look like one, doesn't it? Except hang on, this graph should be showing the same data as my one, why does it look so different? Surely they haven't been abusing statistics for political gain?

Well, before we accuse them of that, let's check a couple of common tricks people use when presenting bar charts to try and give a particular impression.

First up, it's the 'cut the y-axis above zero' method. Here that means rather than having the bottom of the graph equivalent to zero votes, having it equivalent to something larger. The Lib Dems can't have done this though, because that would only exaggerate the difference in votes. To demonstrate, if we dismiss the Tories and just plot the Lib Dem and Labour votes, and have a cut-off at 9,000 votes, it looks like this:

Wow, no point voting for anyone other than Labour here, they've got it wrapped up... (Obviously, were we making real propaganda, we'd leave off the y-axis; you can't have people reading that and working out what we're up to!)

So the Lib Dem's can't have done that, so another option is a logarithmic y-axis. What this means is that rather than each mark on the y-axis indicating a constant increase of votes, each mark instead corresponds to an increase by a factor, maybe 10. In other words, whilst a standard axis will go 1,000, 2,000, 3,000, and so on, a logarithmic one would go 1,000, 10,000, 100,000, increasing by a factor of 10 each time. These scales are useful for when you're trying to show a graph with both very large and very small numbers. It would seem a bit silly to use one here, but can it explain the Lib Dem graph?

Encouraging? Maybe. Notice now how everyone seems much closer, and that the y-axis is increasing logarithmically; going up in multiples of 10. This still doesn't really look like the graph the Lib Dems produced (the Tories seem a lot closer than they should be), so let's tweak it a bit more, and go back to cutting the y-axis off somewhere suitable. We'll also drop the pesky marks on the y-axis that actually tell us what's going on:

Aha! That's much more like it. Not a perfect imitation, but certainly getting there. We've got the Tories down as an also-ran, and the Lib Dems really giving Labour a run for their money. We could probably pick a better logarithmic factor (we used 10 here) to get the Lib Dem and Labour bars a bit closer together, but I think by now we've established that the Lib Dems are really just playing Silly Buggers. I can't imagine they actually fished around for a good scale on which to make the graph look like that, instead they've just drawn some appropriately shaped bars and stuck the numbers on. Of course, they've told us the numbers (and even given a source for bonus authenticity!), so it's our own fault if we just look at the coloured rectangles and draw the wrong conclusion. Still, that's precisely what they're hoping people will do, and it's a great example of why people don't trust statistics.

Tuesday, 4 May 2010

Practical Probability - Is insurance a 'tax on the stupid'?

In a previous post I talked about gambling, and specifically the value of lottery tickets. I opened with the line "lotteries are a tax on the stupid", which I have often heard people trot out when they feel it pertinent. When someone says this in my earshot, I have a simple question in reply: "Do you have home insurance?". Almost invariably, the answer is "yes...why?".

Suppose I've set up a lottery, let's call it Thundercracker. I quite like money, but I'm also a bit lazy, so my lottery isn't very complicated. Each week you pay me £1 and get a lottery ticket where you pick a number from 1 to 10. I'll then hold a draw where I pick a numbered ball out of a bag, if your number comes out I'll give you £5, if not, you win nothing. We can work out your 'expected' returns in the same way we did when talking about coin tosses. You have a one in ten chance of winning and profiting £4, and a nine in ten chance of using and losing £1 (or, to put it another way, profiting -£1). To return to the vernacular from the previous post:

You win with probability 0.1 and profit £4
You lose with probability 0.9 and profit -£1

and so your expected profit is 0.1*£4 + 0.9*-£1 = £0.40 - £0.90 = -£0.50. On average you lose (and so I profit) 50p every week. Sounds good to me, and aren't you so stupid to keep playing when the odds are stacked against you?

One week however, I get bored of the balls in a bag lark, and I decide to change the rules slightly. I happen to know you're a bit of a minimalist, and that the value of everything in your home is £5. Now, rather than giving you £5 if I pick your ball out of the bag, I'll give you £5 if instead everything in your house gets stolen. From your perspective nothing has changed (fiscally at least): if you 'lose' the lottery (that is, your stuff doesn't get stolen), you're down the £1 you paid to me for your lottery 'ticket'. On the other hand, if you 'win' the lottery (by having all your stuff nicked) then you win £5 from me. Because the lottery has nothing to do with whether your stuff got stolen or not, you would have been in that predicament anyway, so the £5 I give you is just like the £5 you get if you win the old lottery. In fact, I've decided the probability that you'll get burgled in any one week is 1 in 10, so I continue to make the same profit I did before, and you the same (expected) loss.

This is a bit of a silly example, but it illustrates the principle: paying however much money a week for insurance is doing exactly the same thing as playing the lottery is, at least in terms of financial loss or gain. The only difference is that in a lottery the probabilities are all easy(ish) to calculate, whereas things are a lot less clear for insurance.

However, one thing you do know about insurance companies is that, like casinos, they always win (otherwise they would go out of business). So overall they are going to be offering worse returns than they should given the true chances of bad things happening. You might find a policy which you individually are expected to profit from, but you would be very fortunate to do so.

Of course, losing your house is perhaps as bad as winning millions of pounds is good. Indeed, when talking about lottery tickets I discussed how the 'value' of an outcome isn't necessarily simply the number of pounds you get from it. The same logic can be applied here. Fiscally speaking, insurance sets you up for a loss in the same way a lottery ticket does. However, many would argue the value they ascribe to the various possible outcomes means that insurance (to them, at least) is worth it overall. Others may feel the same about playing the lottery. Is either really a 'tax on the stupid'? It depends on where your values lie.