Saturday, November 19, 2005

The Church of the Null Hypothesis

All things being equal

The concept of null hypothesis is a stumbling block to many statistics students. Perhaps people resist the notion of null hypothesis because it is the opposite of what they expect. The null hypothesis merely states that nothing special is going on. One proves that something significant is occurring by rejecting the null hypothesis. That is, we demonstrate that something is happening by rejecting the hypothesis that nothing is occurring. Talk about bending over backward!

The null hypothesis deserves more respect than it gets. I believe most of us have a mindset that neglects the contingent nature of existence. This bias toward meaningfulness causes people like Carl Jung to confuse random coincidence with significant synchronicity. Instead of postulating some weird "alignment" of universal forces, why couldn't Jung accept that in a lifetime of occurrences some coincidences would be more startling and remarkable than others? We do, after all, reside in an environment loaded with circumstance. Every day we have thousands of thoughts, meet dozens (hundreds?) of people, see tens of thousands of images, read hundreds (thousands?) of words, and make innumerable choices. Such a combinatorial plethora is all but certain to generate (at thoroughly unpredictable intervals) the occasionally startling coincidence. The startling coincidence will have just "happened" and will have neither significance nor meaning.

Perhaps this point of view is disappointing or even unacceptable to those who prefer to cherish coincidences and endow them with Jungian significance. For those who can't let go of the idea that such occurrences have to mean something, we can ask what happened to the random outcomes. Do they not occur? Are there truly no coincidences? Or is there a moderate position that declares some occurrences are significant and some are not? In this case, how can we tell the difference? And if we can't, how meaningful can the difference be?

The null hypothesis is more than a simple statement that nothing is going on. Its role in statistics is to provide a neutral baseline from which alternative hypotheses are evaluated. For example, suppose you are responsible for testing the claim that a new drug is efficacious. Naturally, the null hypothesis would be that the new drug makes no difference. You conduct a series of drug trials and find that those patients who received the drug did slightly better than those who received a placebo. How do you decide that the improvement was large enough to be meaningful? You return to the null hypothesis, the claim that there is no effect, and calculate the probability that the improvement could have occurred purely by chance. If you find that the improvement could have occurred only 5% of the time by mere chance, you would be justified in saying that the drug is better than the placebo. (Statisticians refer to the 5% threshold as the level of significance. The choice of level of significance is a judgment call, although 5% and 1% are traditionally the most popular.)

The null hypothesis is not a belligerent option. It is a touchstone or standard against which rival claims are gauged. If an alternative does not show itself to be sufficiently remarkable relative to the null hypothesis, then the alternative is not deemed worthy of provisional acceptance. That is, we do not reject the null hypothesis (nothing is happening) for the alternative hypothesis (something is going on) if our experimental results could easily occur under the null hypothesis. In terms of our drug-testing example, why would you accept the purported efficacy of the drug if the observed improvement could have occurred 40% of the time by mere chance? Sure, a 40% chance is less than even odds in favor of nothing happening, but it is still much too high to warrant much faith in the treatment. Statisticians routinely set the significance bar at 5% or even 1% to ensure that we do not reject the null hypothesis too casually.

A pox on all houses

Unfortunately for statisticians, the public most often encounters measures of significance in political polls, surveys whose results are controversial and contentious by their very nature. Poll results are usually stated with the caveat that they have been computed with a 95% confidence level. In other words, there is only a 5% probability (there's that 5% again) that the results are wrong by more than a specified amount (the specified amount is usually given as plus-or-minus a certain number of percentage points, the number of points determined by the poll's sampling size). Stated another way, if the poll were repeated multiple times, the results would be seriously wrong one time out of twenty (on the average). Given that political polls are conducted frequently during the most heated contests, we run into the unhappy situation where bitter accusations of bias or incompetence (from whichever candidate is trailing in the polls) are aligned with just enough divergent results (every twentieth, on the average) to cause people to throw up their hands and declare that pols, polls, and pollsters are all reprehensible. (I have not even raised the point that some polls are indeed conducted by hirelings who skew the outcomes to favor the candidate who hired them. The point that politicians may be scoundrels hardly needs to be made, as examples are conveniently numerous these days.)

The other unfortunate factor in political polling, quite apart from biased pollsters and acrimonious debates concerning even responsible polling, is that polls are by their nature no more than snapshots. A poll that finds 45% of the voters in favor of Candidate A in October is not suddenly invalidated if Candidate A receives 51% of the vote in November. For all we know (and there are statistical measures to help us gauge how much we can reasonably know), exactly 45% of the voters were in favor of Candidate A in October. The candidate simply picked up another six percentage points of support between the poll and the election. Still, the consequence of such contrasts is that polls are routinely regarded as having been proved wrong after the fact. That is really too bad, because responsibly conducted polls (in most cases, polls not sponsored by a particular candidate or cause) provide useful information on the opinions of the electorate. As I said, they are snapshots, not predictions.

A prediction

Speaking of predictions, there is another venue in which the poor null hypothesis is routinely treated with abuse. The entire field of psychic research is particularly unfriendly toward the null hypothesis that nothing is going on. Psychic researchers have been reduced in recent decades to searching through their data for subtle signs that something might be happening, attempting to tease out some shred of significance in anything slightly out of the ordinary. This is a good point at which to recall that most statistical tests expect the null hypothesis to be incorrectly rejected about 5% of the time anyway, just by chance. Of such Type I errors entire careers have been constructed. The diligent psychic researcher, however, will find that the false positives will eventually settle down at the unmeaningful 5% level as he or she continues to investigate. A notable example is Dr. Susan Blackmore, who eventually abandoned research in parapsychology for the more fertile field of consciousness. (See in particular her short essay on giving up parapsychology.)

The most parsimonious explanation for the longterm and continuing failure of parapsychological research is that they are searching for something that is simply not there. Rare examples like Susan Blackmore notwithstanding, I confidently predict that psychic research is here to stay. Its devotees are too emotionally invested in the idea that coincidences, lucky guesses, and intuition are deeply significant representatives of profound and underdeveloped human powers. The null hypothesis is a more satisfactory explanation because it is simple and sensible. What it lacks, however, is allure and mystery, so the null hypothesis will continue to be rejected by those whom it fails to satisfy.

I am confident in my prediction, but I will not, however, claim that I am clairvoyant if it comes true.


Anonymous said...

Returning to the medical example, does the magnitude of the improvement figure in the analysis? I am imagining a situation in which 40% of people were cured.

As for coincidences, is it fair to say that any coincidence is bound to happen but that a particular coincidence is unlikely?

Zeno said...

It is perfectly fair to say that coincidence is bound to happen, but that any particular coincidence is highly unlikely. It would be quite a coincidence for me to run into the man who hired me for my faculty position nineteen years ago right after telling someone about how I got hired, but that happened last month. Of course, I didn't anticipate that or predict it in advance.

As for drug trials, I'm sure that outright cures in 40% of the cases would be judged a brilliant success no matter how conservatively the original experimental model was set up. I'm also sure that real-life researchers must take the magnitude of the purported effect into account in some systematic way, but that's not something I know anything about.

Nick Barrowman said...

In medical research, a distinction is made between statistical significance and clinical significance (or clinical relevance, as it is sometimes called). A study with a large enough sample size will always obtain a statistically significant estimate, but whether its magnitude is clinically significant is another matter (and in many ways a more slippery one).

This distinction has a connection with another point. The null hypothesis is not always the way you've described it. In the special case of equivalence testing, the roles of the null and alternative hypotheses are reversed. This makes good sense in certain situations. Suppose for example you have an established treatment and along comes a cheaper treatment that is anticipated to be just as good. Now you have to specify what you mean by "just as good" and that relates to the issue of clinical significance. This is explained quite nicely in this piece on statistical tests for equivalence.

Zeno said...

Thanks, Nick, for a very enlightening comment and a useful link. I appreciate instruction from those whose expertise exceeds my own (and their name is legion). Needless to say, I have no experience with clinical research.

Anonymous said...

In Innumeracy, a book that I will never stop talking about because it provides the most readable exposition of elementary statistics I've ever seen, gave a nice example of the distinction between statistical and clinical significance. I don't remember the exact numbers he used, but it was something like this: suppose that a test of a new headache medication showed, in study after study after study of thousands of participants, that it was shown to be effective in 5% of headaches, compared to the 2% that were helped by the placebo. That's a statistically significant result. But how much would you pay for this new drug?

Zeno said...

Thanks, MS. I should have remembered that. It's been a while since I've read Innumeracy. Perhaps I should dig it out again.

Thursday said...

Peter: *gasp* Brian! My Alpha Bits say "Ooooo!"

Brian: Peter, those are Cheerios.

-Family Guy

A willingness to believe in the exceptional can easily be confused with an unwillingness to accept the common. Most skeptics, I think, are willing to believe that something special could be happening, if there is enough rational evidence to warrant that belief; as opposed to mystics, who will reject that something could be mundane if there is the slightest possibility that a given event could be exceptional.

Moebius - There was a newspaper report on this stuff in Canada called "Cold F/X" and the shaky claims that their papers "support" - full of data mining, misuse of terms and small sample groups. Big seller, though, as one of the spokesfigures is a hockey broadcasting legend, so there's one answer to your question:
Who endorses it?

GreenSmile said...

"bias toward meaningfulness"...great idea. I grope around that idea all the time on my blog and never put the problem so succinctly. I think it is kind of the same as my explaination of why science comes off so weakly in the media when set opposite people who spout certainties.