One of my colleagues sent me an article in the Financial Times from March 17 entitled “How to save a penalty: the truth about football’s toughest shot. On star goalie Diego Alves, game theory and the science of the spot kick.” I found the article interesting for two reasons.

- It has a fun discussion of the psychology and game theory of taking penalty kicks. It points to the paper by Ignacio Palacios-Huerta in which he shows that professional soccer players take penalties in a way that is consistent with Nash equilibrium (or minmax) behavior. The FT article also includes an interesting interview with Ignacio Palacios-Huerta and his “analysis of ideal penalty-taking strategies for the then Chelsea manager Avram Grant before the Champions League final against Manchester United in 2008.”
- The FT article highlights Diego Alves, Valencia’s goalkeeper, and argues that he is particularly good at stopping penalties. The FT article argues that Diego Alves’ stopping record (he stopped 22 of 46 penalties – a very high number compared to the average stopping rate of 25% of all goalkeepers combined) cannot be explained by chance alone.

In this blog post I want to comment on the 2^{nd} point. It is actually wrong. And it is wrong for an interesting reason. Moreover the mistake made is very easy to make and is a very common one.

So how does the analysis in the FT work? We are interested in testing the null hypothesis that Diego Alves’ true stopping probability is (at most) 25%. If this null hypothesis is true the probability of observing Diego Alves stopping 22 (or more) shots out of 46 is given by the binomial formula and can be calculated to be 0.0676%. Statistically minded readers will know this as the p-value (associated with the given one-sided null hypothesis).

As this probability is very small (much smaller than the commonly used 5% cut-off), the FT article then claims that the null hypothesis must be wrong and that Diego Alves really has a true stopping rate that is higher than 25%.

But this is not necessarily so. So what is wrong? In this analysis we forget that it was no accident that we chose to look at Diego Alves. Why did the FT look at Alves? I guess the only reason for this choice is that he has made an unusually high proportion of stops. So if Diego Alves had not made such a high number of stops, but someone else had we would have looked at this someone else. In other words the FT article is not about Diego Alves, it is about the goalkeeper with the highest proportion of stops. This person just so happens to be Diego Alves.

So what does this mean? It means that we have to take into account that we are looking at the highest empirical stopping rate of about 400 goalkeepers. The FT has a nice graph looking at penalties faced (on the x-axis) and penalties stopped (on the y-axis) for many goalkeepers from the top leagues. I roughly counted (estimated) that this graph has about 400 dots (i.e. 400 goalkeepers).

Now I am going to make a mistake myself. I am now going to make the empirically clearly wrong assumption that all of these 400 goalkeepers have faced exactly 46 shots. I do this, so I can make my point fairly simply and quickly. If I had the full data I could do this correctly. But I think it is sufficient for the point I want to make.

Suppose therefore that we have 400 goalkeepers who each have faced 46 shots. What is then the likelihood that the best of them has stopped 22 (or more) of these 46 shots if the true stopping rate is 25%? This can be calculated as 1 minus the probability that all of them make less than 22 stops. This is given by 1-(1-0.000676)^400. This probability is 23,7131%.

To summarize, if the true stopping rate of all goalkeepers is 25% and if all 400 goalkeepers face 46 penalty shots, then the probability that the best of them stops 22 or more of these 46 penalties is 23,71%. That is not sufficiently improbable (it is for instance higher than the usual 5% cut-off) to make me abandon the null hypothesis that all goalkeepers (including Diego Alves) have the same 25% stopping rate. In other words, I do not believe that there is anything particularly special about Diega Alves’ penalty kick stopping ability. We could do another test after he has faced another 46 penalties. I doubt he will save another 22 of these.

I think this nicely illustrates the narrowness of frequentist reasoning. Your argument is correct, as far as I can see, but it doesn’t tell me what football fans are really interested in, namely what I should believe about Diego Alves’ penalty stopping probability, given that he has stopped 22 out of 46. You only make statements about how likely it is to see someone – anyone – to stop 22 out of 46 penalties given the null hypothesis. This is not what the FX readers really want to know. They want to know what the observed data suggest about Diego Alves’ skills.

Worse, a careless reader might interpret you as saying: “Diego Alves is probably not an above-average keeper.” However, your argument shows no such thing. Such a statement doesn’t even make sense in a frequentist worldview. Either Diego Alves is above-average or not, and your calculations reveal nothing about his true type.

Suppose I have a prior belief, before observing anything, that Alves is just a normal keeper with a 25% stopping rate. If I would ignore my prior belief, or equivalently, if I had a completely flat prior, I would conclude that the best guess for Diego Alves’ true stopping probability is 22/46 = 47.8%. That’s the Maximum Likelihood estimate. If I take into account my prior belief, I should take a weighted average between 25% and 47.8%. How much weight should I give the prior?

Suppose I started out with a flat prior, then observed 400 keepers, whose average penalty stopping rate was 25%. This will make me give a lot of weight to the prior, but less than 100%. It is reasonable to conclude, based on the evidence, that Diego Alves’ stopping rate is more likely to be above 25% than below. The best guess about his probability of stopping one more penalty (the posterior predictive provability) is higher than 25%. In other words, the fact that he stopped 22 out of 46 penalties, makes it more likely than not that Diego Alves is an above-average keeper.

It is indeed an interesting question as to what one “should” believe about Diego Alves’s skill as a goalkeeper after seeing the evidence. Note however that this is not a problem of the frequentist versus the Bayesian approach. It is simply the case that the data is consistent with many prior beliefs (if you want to be Bayesian) or null hypotheses (if you want to be classical or frequentist).

If you initially believe that all goalkeepers are equally good, then the statistics given by the FT (that diego alves saves more than anyone else) does not in itself contradict this belief. Of course, if you initially believe that there is heterogeneity in goalkeepers’ skills, then the data is also consistent with such a belief (although one would actually then want to look more carefully at the full data set) and then the updated belief would be such that we believe that Diego Alves is probably an (at least slightly) better goalkeeper than many others. I wouldn’t necessarily attach a very high likelihood on him being the best goalkeeper, though, even in this case.

In any case, the main point of interest, I thought, was that, regardless of whether you approach this problem as a Bayesian or a classical statistician, you should not ignore the data selection. One should think about why we are looking at Diego Alves. If it is the case that you already suspected 5 years ago that Diego Alves is a very good keeper, and you look at data only after that time, then the data allows you to reject the hypothesis that he is an average goalkeeper. If you only now decide to look at Diego Alves because he has had such a great penalty saving rate, then the data does not allow you to reject the hypothesis that he is an average goalkeeper.

I did not interprete your post as an evaluation of Alves, but rather as a discussion on “what is such data able to tell and what not”. Given that, I think, many students would appreciate discussions like this in courses on statistics. It not only helps to understand the method, but also why it is important to understand it in the first place.