Testing our scientific tests

We recently reviewed the abstract and slides of Deborah Mayo’s 3 Dec. 2014 presentation at Rutgers University. Her excellent talk was titled “Probing with severity: beyond Bayesian probabilism and frequentist performance.” (Both the abstract and the slides of her lecture are available here.) Here, we shall sum up what we consider to be the most valuable contribution of Dr Mayo’s thoughtful lecture and then offer a brief and friendly critique of her main point. Without further ado, then, let us begin.

Dr Mayo’s lecture (and her work generally, we might add) is important to us because she not only notes the importance of testing one’s theories (i.e. she takes a Popperian approach to science); she also recognizes the importance of testing the “severity” of one’s tests of one’s theories.

For our part, we agree with Dr Mayo that Popper’s approach is the correct one. It’s what distinguishes science from aesthetics, politics, or religion. For example, in Slide #17 of her presentation, Dr Mayo quotes Sir Karl Popper at length: “Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory—or in other words, only if they result from serious attempts to refute the theory.” (Popper, 1994, p. 89.) Simply put, we are not really doing science unless we are able to critically “test” our theories and intuitions, unless there is some reliable way of proving our theories and intuitions wrong.

Next, Dr Mayo makes two important points regarding tests. (See Slide #18 of her Dec. 3 talk.) First, she notes that when we test a hypothesis or claim, our test must be a “severe” or “genuine” one. A weak or non-severe test is really no test at all: “Agreement between data x and [hypothesis] H fails to count in support of … H, if so good an agreement was (virtually) assured even if H is false—[this is] no test at all!” She goes on to say (emphasis in original): “Data x (from test T) are evidence for H only if H has passed a severe test with x (one with a reasonable capability of having detected flaws in H).”

Second, Dr Mayo notes that Popper himself “never gave [us] a way to characterize severity [or genuineness] adequately.” (Along these lines, Dr Mayo also shares with us the juicy tidbit that Popper once confessed to her in writing his regret for never having learned statistics.) We thus come to the main purpose of Dr Mayo’s lecture: to fill this huge gap in the Popperian approach. She argues that “the central role of probability in statistical inference is severity—its assessment and control” (see Slide #19), and the remainder of her presentation is devoted to showing how “error statistics” and frequentist methods (confidence levels, significance levels, etc.) may but need not provide good assessments of severity.

We have no quarrel with Dr. Mayo’s argument. Our only (friendly) criticism is this: why can’t Bayesian methods be used to assess the severity of scientific tests? In other words, we agree with Dr Mayo about the need to assess the severity of a given test (i.e. the need to test our tests). We would only add that the Bayesian approach is one such way of making this assessment. (Check this out, for example.) In short, we update our Bayesian priors up or down depending on the “severity” or “genuineness” of the method being used to test a hypothesis or claim. The more severe or genuine a test is, the more we are justified in updating our priors in a certain direction …