Lies, damned lies, and … statistics

Note: this is the fourth part of our review of the paper “Bayesian reasoning in science” by Colin Howson and Peter Urbach. (The fifth and final installment of our review shall appear on 28 June.)

Let us return to Howson and Urbach’s Bayesian paper today. After presenting Bayes’s rule and the Bayesian approach to truth on pp. 371-372 of their paper (see our Bayesian blog posts of 25-26 June for reference), Howson and Urbach concede that the Bayesian approach to truth “has been widely criticized because it is based on personal, hence subjective, probabilities [cf. the problem of priors we talked about in our post of 26 June titled “Beliefs are like gambles”]. Scientific inference, critics say, should be perfectly objective.” Howson and Urbach thus spend the rest of their paper comparing and contrasting the Bayesian approach to truth with its leading challenger, what they refer to as the “classical statistical inference” model, an alternative approach to truth associated with the work of such giants as R. A. Fisher, Jerzy Neyman, and Egon Pearson (all of whom Howson & Urbach lump together as “classical statisticians”).

In brief, Howson and Urbach begin the second part of their paper by noting that the “classical” or non-Bayesian approach to truth “has two principal parts, the first relating to the testing of hypothesis (using significance tests) and the second to estimating the values of unknown parameters.” (In this post, we shall focus on Howson and Urbach’s critique of Fisherian hypothesis testing and the related idea of “significance”.) The authors then take a simple example to illustrate the Fisherian approach: an experimenter tossing a coin 20 times and counting the number of times the coin lands “heads” in order to test whether the coin is fair or not. “There are 21 possibilities,” they write, “ranging from no heads and 20 tails to 20 heads and no tails.” But how does the experimenter in this simple example know whether the coin is fair, i.e. how does he actually “test” his hypothesis in this case? If he is a Fisherian, he must perform a secondary “significance test”; that is, he must now proceed to “test” his results from the 20 previous coin tosses (though not the coin itself).

You will find the splendid details of Howson and Urbach’s critique of significance testing on pp. 372-373 of their paper, but their main point, as we understand it, is this: whether the experimenter’s coin-toss results in the example above are “significant” in a statistical sense at some predetermined level (such as 0.05) tells us nothing about the actual coin being tested! Why? Because a significance test is not a direct test of truth; it is simply a secondary or subsidiary test of one’s experimental data. (By way of analogy, consider the difference between a historical or legal investigation into the actual contents of a document versus an investigation of the way in which that document was made.) There is thus no necessary or logical relation between the “significance” of a given statistical test and the truth of the hypothesis being tested.

Worse yet, Howson and Urbach note that significance results are easy to manipulate and are super-sensitive to experimental design. In particular, they present this additional critique of significance testing on p. 373 of their paper — the stopping-rule problem:

In our earlier example, it was assumed … that because the coin was tossed 20 times, all of the possible outcomes would exhibit [some combination of] 20 heads and/or tails. But these are the possible outcomes only if the experimenter has a premeditated plan to throw the coin 20 times. Had the plan been to stop the experiment when, say six heads appeared, he could have got just the result he did, but with a different list of unrealized, possible outcomes.

So what? Here’s what:

Because significance is calculated by reference to these [unrealized, possible] outcomes, a result could be significant if the experimenter had had one plan (or stopping rule in mind), but not significant if it was another.

In short, in the eloquent words of Howson and Urbach: “This dependence of significance tests … on the subjective, possibly unconscious intentions of the experimenter is an astonishing thing to discover at the heart of supposedly objective methodologies. It is also a most inappropriate thing to find any methodology, for the plausibility, or cognitive value, of a hypothesis … should not depend on the experimenter’s mind.” (Ouch!)

But hold on in a minute … what about the problem of subjective priors (which we noted in our post “Beliefs are like gambles” below)? Does the subjective Bayesian approach to truth fare any better than standard Fisherian methods? Stay tuned …