Proofs of the Likelihood Principle have convinced me that frequentist methods fail to respect evidential equivalence—or, better, that they fail to respect some strong intuitions that I and many others have about evidential equivalence. On the other hand, it’s not clear to me that the fact that frequentist methods fail to respect evidential equivalence is a strong argument against their use. There is an important strand of frequentist thinking according to which frequentist methods should not be interpreted epistemically and are justified solely by their long-run operating characteristics. I am sympathetic to this perspective because it seems to me that what ultimately matters is not whether our methods gratify our intuitions, but rather how well they help us achieve our epistemic and practical goals. On the other hand, the fact that a method has good frequentist properties is not sufficient to ensure that it works well in a more general sense.

Take uniformly most powerful (UMP) tests for instance. A UMP level α test of a given null hypothesis H_{0} for data drawn from one of a given set of sampling distributions is the test that has the highest probability of rejecting H_{0} on any of the simple hypothesis that make up the alternative hypothesis among all possible tests that reject H_{0} with frequency at most α in the long run when H_{0} is true. The UMP property is attractive, but it is not sufficient for a good test.

Consider the following example. Let X be a random variable that can take values x_{1}, x_{2}, x_{3}, and x_{4}. We are interested in performing a level.05 test of the point null hypothesis H_{0} against the point alternative H_{a}, which assert the sampling distributions shown below.

In this case, four nonrandomized level .05 tests are possible: a test that fails to reject H_{0} regardless of the data; and tests that reject H_{0} if and only if X=x_{1}, if and only ifX=x_{2}, if and only if X=x_{3}. Among those tests, the one that rejects H_{0} if and only if X=x_{1 }has the highest probability of rejecting H_{0} for every simple hypothesis that makes up the alternative hypothesis H_{a}, namely H_{a} itself. Thus, it is the UMP level .05 test.

But this test seems clearly unsatisfactory. x_{1} is *more* probable under H_{0} than under H_{a.} In that sense, H_{0} accounts for x_{1} better than H_{a} does. It thus seems unreasonable to reject H_{0} in favor of H_{a} because of x_{1}. Moreover, there are two data points (x_{2} and x_{3}) that are each more probable under H_{a} than under H_{0}, and thus would seem to speak against H_{0} and in favor of H_{a} more strongly than x_{1} does. In fact, x_{1} seems to be the *worst* data point on which to reject H_{0} in favor of H_{a}: x_{4} is less probable under H_{a} than under H_{0}, but the ratio of Pr(x_{4}|H_{a}) to Pr(x_{4}|H_{0}) is very close to one (.98) whereas the ratio of Pr(x_{1}|H_{a}) to Pr(x_{1}|H_{0}) is only .8.

This example shows that while a UMP level α test might be in an important sense the best test with a given Type I error rate on a given problem, it may be clearly less reasonable than a UMP test at a different α level. In this case, the tests that reject when X=x_{3} or when X=x_{4} and the test that rejects when X=x_{3} or x_{4} all seem more reasonable than the test that rejects when X=x_{1}.

Preferring the test that rejects when X=x_{1} over both the test that rejects when X=x_{3} and the test that rejects when X=x_{4} makes sense if the cost of a Type II error is much higher than that of a Type I error. Similarly, preferring the test that rejects when X=x_{1} over the test that rejects when X=x_{3} or x_{4} makes sense if the cost of a Type I error is much higher than that of a Type II error. But preferring the test that rejects when X=x_{1} over all three alternatives would be very strange, and indeed that pattern of preferences cannot maximize expected utility.

The argument I have just given is not an argument against frequentist methods, because frequentists need not and typically do not assume that all UMP tests are sensible. According to standard accounts, a frequentist can choose how to trade off α against βlevel by specifying *indifference curves* in the α-βplane such that any points on the same indifference curve are not preferred to one another and a preferential ordering of indifference curves. Ranking (α,β) pairs by their expected utility yields indifference curves that are parallel straight lines, but a frequentist regards the expected utility of a test as typically ill-defined because it requires a probability distribution over the hypothesis space. Thus, there is room within frequentist theory for indifference curves that specify a preference for the test that rejects when X=x_{1}. Usually, however, a frequentist would prefer to this test both the test that rejects when X=x_{3} or x_{4} and the tests that reject when either X=x_{3} or X=x_{4}.

Although the argument just given is not a serious objection to frequentist methods, it does raise questions about the claim that frequentist methods are justified by their long-run operating characteristics. According to any statistical school it is better to perform a UMP test level αthan a non-UMP level α test, but UMP tests often do not exist and not all UMP tests are sensible.

Want to keep up with new posts without having to check for them manually? Use the sidebar on the left to sign up for updates via email or RSS feed!

## Leave a Reply