Forecasting the forecasts

Note (1/4): This post has been significantly revised.

In our previous post, we used Bayesian reasoning to revise or “update” our forecast about the possible outcomes in Gamble v. United States, a case that was heard by the Supreme Court of the United States (SCOTUS) on 6 December 2018 and that as of today (31 December) is still pending. In particular, based on the high number of amicus briefs submitted to SCOTUS in this case, we concluded there is a 37% probability that SCOTUS will overrule the “separate sovereigns” doctrine when it announces its decision sometime in 2019.

But how can we tell whether our forecast is a good one or not? Standing alone, we cannot. Either SCOTUS will or will not change the jurisprudential status quo when it decides Gamble. But when we are able to make many probabilistic forecasts–i.e. when we forecast the outcomes of a large number of cases–we can then measure or score the accuracy of our forecasting methods using a scoring method that was first proposed by Glenn Wilson Brier, who was an early advocate of probability forecasting and the use of probability forecasts in decision making. (See Glenn W. Brier, “Verification of forecasts expressed in terms of probability,” Monthly Weather Review, Vol. 78, no. 1 (1950), pp. 1-3. For more biographical details about Brier’s life and forecasting work, see here.)

In its simplest formulation, Brier’s simple scoring method produces a “Brier score” as follows:

Brier Score = 1/N ∑ (f_x – o_x)²

where N is the total number of forecasts, f_x is the probability that was forecast, and o_x is the outcome of the binary event that was the subject of the forecast. Before proceeding, it is worth noting that the value for f_x must always be expressed in the range of 0 to 1 and that the value for o_x must be either 0 or 1: zero if the event does not happen and 1 if it does happen. (The subscript x is just a gentle reminder that the accuracy of a set of predictions will be unknown until the events being forecast take place or not.)

In plain English, the Brier score attempts to quantify the accuracy of a set of probabilistic predictions of binary events by measuring the mean squared error of each prediction. Using this simple formulation, this score will take on a value somewhere between 0 and 1, since this range is the largest possible difference between a predicted probability f_x, which must be between 0 and 1, and the actual outcome o_x, which can take on values of only 0 or 1. The lower the Brier score is for a set of predictions, the more accurate those predictions are as a whole. (In Brier’s original formulation of his scoring method, the range is from 0 to 2. Tetlock and Gardner offer a good explanation of the logic of Brier’s scoring method on pp. 59-66 of their “superforecasting” book.)

We can illustrate this ingenious scoring method by returning to our initial example, the aptly-named case of Gamble v. United States. Taking into account the number of amicus briefs that were submitted in this case, we reasoned there is a 37% probability (f_x = 0.37) that SCOTUS will overrule the “separate sovereigns” doctrine when it decides Gamble. Therefore, if this unlikely event were to occur, o_x will assume the value 1; if it does not occur, o_x is zero. Given this prediction, my Brier score for this forecast can be calculated as follows: (1) if SCOTUS does decide to overrule itself in Gamble, my Brier Score is going to be (0.37 – 1)² = (– 0.63)² = 0.3969; (2) if, however, SCOTUS decides to not overrule the precedent, my Brier Score is (0.37 – 0)² = (0.37)² = 0.1369. [*]

Notice that my Brier score is higher when SCOTUS does not change the jurisprudential status quo, i.e. when it does not overrule its precedents. Why? Because saying there is 37% probability that an event x will happen (e.g. a change in the status quo) is equivalent to saying there is a 63% that x will not happen (e.g. the status quo will remain), so I should receive a higher score if x does not happen. That said, a single forecast is not enough to measure my forecasting acumen. To truly measure the forecasting accuracy of my Bayesian methods (and of my use of amicus briefs as a proxy for whether the status quo will change or not), we must make a large number of forecasts. For instance, if my Bayesian prediction model is a good one, SCOTUS will change the status quo close to 37% of the time in the entire population of cases in which my model predicts such a change. We will thus assemble a larger database of cases with which to test our model, and we will report our results in a future post. In the meantime, Happy New Year!

Source: David Lowe (Scrum & Kanban)

[*] What if I had stuck to my initial base rate and predicted a 13% probability (f_x = 0.13) that SCOTUS will overrule the “separate sovereigns” doctrine in Gamble. In that case (pun intended), if SCOTUS were to overrule itself, my Brier Score would be (0.13 – 1)² = (– 0.87)² = 0.7569. (In the alternative, if SCOTUS were to retain the status quo, my Brier Score would be (0.13 – 0)² = (0.13)² = 0.0169.) In other words, I could play it “safe” and stick with my base rate, but a prediction based on my base rate alone does not score as high as a prediction based on my Bayesian updating procedure if the prediction comes true, since 0.3969 is closer to zero than 0.7569.