One of the most damning phrases in the scientific literature is “Not statistically significant.” In a world where policy is announced by 140 character twits, the wordy “not statistically significant” readily becomes “not significant” and then “irrelevant.” But “not statistically significant” has a very specific meaning…and only rarely does it mean “irrelevant.”

Formally, we have the following problem: There is a mass of evidence, and two possible ways that evidence could have been produced, called the * null hypothesis* and the

*. Your mission, should you choose to accept it, is to decide which hypothesis is correct.*

**alternate hypothesis**Unfortunately, that’s impossible. So statisticians take a different approach: If the evidence is *sufficiently unlikely* to be generated in the universe where the null hypothesis is true, we’ll *reject* the null hypothesis and say that the evidence is * statistically significant*. But if it is sufficiently likely that the evidence would be generated in a universe where the null hypothesis is true, then we’d say that the evidence is

*not statistically significant**and reject the alternate hypothesis.*

For example, consider a coin. There are two possibilities for the state of the universe: *Either* the coin is a fair coin, and will land heads 1/2 the time; *or* the coin is unfair. For somewhat technical reasons, “Coin is fair” is the null hypothesis.

Now let’s collect some evidence. Say we flip the coin 10 times and saw it land heads 8 of those 10 times. Many people might conclude, based on the evidence, that the coin is *unfair*.

Herein lies the problem. By concluding the coin is *unfair*, we have established a guideline for future experiments: “If a coin lands heads 8 out of 10 times, conclude that the coin is unfair.” That works great if someone sees a coin land heads 8 times out of 10 flips. But what if they see it land heads *9* times out of 10 flips? It seems reasonable to also conclude the coin is unfair. Likewise a coin that lands heads 10 times in 10 flips. And if they see the coin land *tails* 8, 9, or 10 times out of 10, they might also conclude the coin is unfair. What this means is that if we make a decision based on the evidence, then any evidence that is *at least as compelling* should lead us to the same conclusion.

Now suppose you have 100 people testing coins. If every single one of them has a *fair* coin, then about 10 of them will see that coin land heads (or tails) 8 or more times out of 10! So about 10% of those testing *fair* coins will conclude they are *unfair* coins.

That 10% corresponds to what statisticians call the *level of significance*. What’s important to understand is that the level of significance is completely arbitrary: it’s based on how often you’re willing to make the wrong decision about the null hypothesis. The *lower* the level of significance, the *more compelling* the evidence must be before you reject the null hypothesis. And if the evidence isn’t sufficiently compelling, you declare the evidence to be “not statistically significant.”

In this case, at a 5% level of significance, you’d need to see the coin land heads (or tails) at least 9 times out of 10. With 8 heads in 10 flips, you’d say that the evidence for the coin being unfair is *not statistically significant*. And yet, most people would hold that this is compelling evidence that you’re dealing with an unfair coin.

Here’s another way to look at it. In the movie *Dirty Harry* (1971), Clint Eastwood utters one of the most iconic lines in movie history. In case you’ve been living under a rock for the past 40 years, the setup is that Eastwood (a cop and the title character) is facing down a suspect after a chase. The assailant has a gun just within reach…but Eastwood has his gun drawn and pointed at the suspect. The problem is:

“Did he fire six shots or only five?” Well to tell you the truth in all this excitement I kinda lost track myself. But being this is a .44 Magnum, the most powerful handgun in the world and would blow your head clean off, you’ve gotta ask yourself one question: “Do I feel lucky?” Well, do ya, punk?

From the dialog, we can assume there are five confirmed shots. There are two possibilities: Either the gun is empty, or the gun has one more round in it.

Let the null hypothesis be “The gun has been emptied.” A statistically informed punk might reason thusly: “It is sufficiently likely that ‘five confirmed shots’ could be produced by a now-empty gun. *Therefore the evidence that the gun has one more round is not statistically significant.*” Consequently, they would reject the alternate hypothesis (that the gun has one more round). In practice, this means they would proceed as if the gun was empty.

(Alert readers will note that the argument works both ways, if we interchange the null and alternate hypotheses. True enough…but as I said, there are somewhat technical reasons for which hypothesis is the null hypothesis, and if this were my statistics course, I’d use this discussion as a lead-in to how we decide)