If you’re going to correct me, correct me correctly

This is a very technical rant about statistics, specifically the way that people talk about statistics. To be more precise, it’s about people correcting each other about the way that they use statistics. If you’re not interested in statistics, then just imagine that I said “When your correction contains errors, it diminishes the impact and value of the good points you are making,” and you can walk away without feeling that you’ve missed out. (That statement is wrong in some interesting ways, but that’s a different topic.)

The paper The Fallacy of Placing Confidence in Confidence Intervals does two things at once: It very correctly points out that confidence intervals are an easily-abused statistical tool. It very incorrectly misuses terminology to make oxymoronic statements regarding the interpretation of statistical data.

One the plus side, there are some really good illustrations of differences between statistical methods in the Submarine Discussion. Different tools yield different results — to the extent that your analysis can be informed by facts beyond the sampled data (in this case, knowing the length of the submarine and the position of the hatch), you can do a much better analysis. (Don’t think about the submarine case study too much — if you’re in the business of rescuing submarines, use experienced submariners, not statisticians.)

They also give some really good advice, like “Never report a confidence interval without noting the procedure and the corresponding statistics.” But in the same paragraph, they’ll then say “In the submersible example, consider a 50% confidence interval .5 meters wide. This could correspond to very precise information (Bayesian interval) or very imprecise information (UMP and nonparametric interval).” Forgive me for talking like an engineer for a moment, but if the confidence interval is 0.5 meters wide, then the precision of the confidence interval is 0.5 meters.

There is no negotiation or interpretation to take place here. As reported in McConnell’s excellent Software Estimation, people routinely confuse ‘accuracy’ with ‘precision’, and that seems to be what happened. Precision is ‘the width of the estimate’, while accuracy is ‘the chance that the estimate is correct’. If I estimate your core body temperature as ‘between 25 and 50 degrees Celsius’, then either I’m right, or you’re dead. If I estimate your core body temperature as 37.253 +- 0.004, then that’s extremely precise. Given how little I know about you, it’s unlikely to be accurate. In this particular example, neither estimate is particularly useful.

The clue to understanding nonparametric methods is in the name. If you don’t know anything at all besides the data values you’ve been given (if you don’t know what the parameters are), then nonparametric methods are a great way of turning complete ignorance into an understanding of what’s plausible. They don’t give you ‘the truth’, but they do give you an understanding of the data you’ve got that protects you from looking like an idiot.

Speaking of which, let’s go to a more recent essay that they’ve posted, and examine this statement:

How can we interpret the confidence interval? Confidence interval advocates would have us believe that the interval [0.108, 0.523] gives “plausible” or “likely” values for the parameters, and that the width of this interval tells us the precision of our estimate. But remember how the CI was computed: using p values. We know that nonsignificant high p values do not rule in parameter values as plausible; rather, the values outside the interval have been ruled out, due to the fact that if those were the true values, the observed data would be surprising.

Firstly, the precision of the interval is plus or minus (0.523–0.108)/2 = 0.201. Because that’s how wide the interval is, this is how precision is measured. (The ‘goodness’ vs ‘badness’ of this number is a case-specific and potentially very complicated discussion. When they discuss virtue, I tend to agree with them.)

How was the CI computed? Using probability. In this case, the theory states (and I believe) that if our sample has been randomly taken from the population, and our confidence interval is wrong, then there’s a 10% chance that we would get the data that we got. A ‘10% chance of being a misleading sample’ is a scary thing to say, while a ‘90% chance of being a confidence-building sample’ is an optimistic thing to say. So we talk about a ‘90% confidence interval’ instead of a ‘10% we got unlucky interval’.

Based on our 90% CI, we agree that we can say “It is implausible to say that the true average value is less than 0.108, or greater than 0.523.” In other words “If you say that the true average is less than 0.108, there’s a 95% chance that you’re wrong. If you say that it’s greater than 0.523, there’s a 95% chance that you’re wrong. If you say ‘it could be either of those things’, there’s a 90% chance that you’re wrong.”

It doesn’t prove anything. There is no such thing as ‘statistical proof’. Statistics allow us to make statements about “What the evidence supports.” The confidence interval tells us which numbers are supported by the evidence, and the p-value tells us how strongly the evidence supports the confidence interval.

Which brings me back to the word ‘plausible’. Apparently, ruling out values as plausible does not rule in other values as plausible. So in the example, there’s a 95% chance that you’d be wrong if you said that the true average is less than 0.108. Most of us wouldn’t take those odds. If someone knowingly accepted those odds, we’d probably question their judgement. Numbers less than 0.108 are not really believable — they’re not plausible. (They might be true! Statistics is not proof. But if they’re true, then we can forgive ourselves for being misled by the evidence.)

But what about numbers between 0.108 and 0.523? Are they plausible? Let’s note that believeability (and therefore plausibility) is not a binary property. If someone lies to us, and we believe them, people don’t judge us based on whether our belief was true or not. They judge us based on whether they would have believed the same lie — whether the lie was plausible. Plausibility is a matter of degree, not a matter of ‘ruled in’ versus ‘ruled out’.

We can definitely say “the values between 0.108 and 0.523 are more plausible than the values outside that range.” Indeed, the most likely values seem to be between 0.2 and 0.4. But as we shrink that range (as we increase the precision of our estimate), the p-value (the chance that we have been misled) increases. The wider the range of values that you select, the less likely it is that you’ll be wrong.

We may even ‘exclude the null hypothesis’ as they say — which means that we’re going to assume that the idea ‘this evidence doesn’t indicate anything’ is false and work with a different assumption. (For instance, that ‘the true average is between 0.108 and 0.526.’) If our data collection and analysis is good, then we can say that reasoning based on that assumption is ‘supported by the evidence.’

A useful statistical estimate has a narrow enough confidence interval to be meaningful, and a low enough p-value that you’ll be forgiven if you’re wrong.

But there’s more to it than that, because there are different ways of calculating confidence intervals and p-values. You might not care what the average value actually is, you might simply want to know if the average value of X is greater than the average value of Y. (The precise chance of me dying by suicide isn’t something I want to know. The fact that my chances are higher if I keep a gun in my home is enough to help me make the decisions I need to make.) If it’s only the comparison that matters, then there are statistical tools for comparison that you can use. If you have knowledge about the population you are sampling, you can feed those parameters into a ‘parametric’ method, and you’ll get better-quality results. The downside is that the parametric methods are harder to learn than the non-parametric methods, and you end up having to learn more stuff.

The answer to fallacious interpretations of p values is not to move to confidence intervals; confidence intervals only encourage related fallacies, which one can find in any confidence interval advocacy paper. If we wish to rid people of fallacies involving p values, more p values are needed, not fewer. Confidence intervals are not “better” than p values. The only way to interpret CIs reasonably is in terms of p values, and considering entire p value curves enables us to jettison the reliance on an arbitrary confidence coefficient, and helps us avoid fallacies.

There’s lots to agree with here. Plus some trolling of ‘confidence interval advocacy papers’, whatever they might be. Understanding p value curves is indeed a very useful thing. Understanding the terminology is also a very useful thing. Accuracy is not precision. Plausibility is not truth.

Written by

Nick Argall is an organization engineer, structuring activities to help businesses achieve their goals. nargall@gmail.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store