Note: We recently concluded a five-part review of the main points in Howson & Urbach’s important paper “Bayesian reasoning in science” (see our various Bayesian blog posts from 25-28 June). We now wish to present our own thoughts in this postscript. Spoiler alert: we are going to apply the self-reference test to both frequentist and Bayesian methods!
Consider, first, by way of example the “independent samples t-test.” (Frequentists have a huge toolkit full of lots and lots of ad hoc statistical tools for evaluating the results of experiments, but we shall focus on the t-test since it’s one of the most common forms of statistical significance.)
Now, instead of tossing one coin 20 times (see our previous posts from 27 & 28 June), let’s say you toss two coins — coin A and coin B — 20 times each (for a grand total of 40 coin tosses). Further suppose that coin A produced 11 heads in 20 trials, while coin B only produced 8 heads … Are these results atypical or completely random (i.e. noise, not signal), or are they “statistically significant” — i.e. within the range of what you would expect to find anytime you toss a fair coin 20 times?
In other words, we want to know whether the difference in results between the two experiments (e.g. # of heads produced by both coins) is “statistically significant” or not, e.g. whether the difference in results reflects a “real” difference in the type of coin used to generate our experimental data. “t-tests” (collecting independent samples and comparing them) and “statistical significance” are thus standard statistical tools for evaluating the results of experiments. But are they “science”?
We leave the ‘science’ question open, for now. The main problem we have with “t-tests” and standard statistical methods generally are their inability to pass the self-reference test (see, for example, our post from 14 May). For example, returning to our coin-toss example above, let’s say that you have finished conducting your (first-order) t-test experiment (e.g. tossing your coins and counting up the total number of heads generated by each coin) and that you have also finished evaluating the statistical significance of your test results. Now, shouldn’t you also conduct a second-order or higher-level experiment to measure the statistical significance of your statistically significant results?
This is not a frivolous or trivial question. In words, the whole purpose of “statistical significance” is to tell us something important about our first-order data (e.g. the results of our coin-toss experiments), but our statistical analysis will, in turn, generate a new set of second-order data, such as sample size (e.g. the number of coin tosses), the size of the difference between the sample averages (e.g. the number of heads generated by each coin), and the standard deviations of the samples.
So, why can’t we test each one of these second-order data points for statistical significance? That is, why can’t we test the t-test itself?
* * *
The Bayesian approach to truth, by contrast, is not only completely open to self-criticism; it is also able to pass the self-reference with flying colors. Just follow the following two steps:
First, you need to assign some subjective prior probability to the truth of Bayes’s rule itself. Note: it doesn’t matter what your priors are in this regard, since you might be highly skeptical of inverse probabilities or you might be a hardcore Bayesian through-and-through, so long as your priors are not completely dogmatic, i.e. 0 or 1. (For example, if you are not a Bayesian or if you simply distrust Bayesian methods, then assign a low value to this prior (a value less than 0.5 but greater than 0). If you are a Bayesian, then assign your prior a high value (a value greater than 0.5 but less than 1); or if you are a good Bayesian, assign a value of 0.5.)
Next, put Bayes’s rule to the test by using Bayesian methods to make predictions or to measure the truth of certain propositions and then “update” or revise your priors accordingly. If the Bayesian approach gives you good results, then … keep on using Bayesian methods. But if Bayesian methods fail to make good predictions or fail to bring you closer to the truth, you have effectively falsified the Bayesian approach … In that case, it’s time to look somewhere else for answers.
But here’s the rub. That “something else” should in principle be open to self-criticism. It should be subjected to the self-reference test. It should be falsifiable. Bayesian approach has the virtue of meeting these conditions. Are frequentists able to?
True or false?