Proofs of the Likelihood Principle have convinced me that frequentist methods fail to respect evidential equivalence—or, better, that they fail to respect some strong intuitions that I and many others have about evidential equivalence. On the other hand, it’s not clear to me that the fact that frequentist methods fail to respect evidential equivalence is a strong argument against their use. There is an important strand of frequentist thinking according to which frequentist methods should not be interpreted epistemically and are justified solely by their long-run operating characteristics. I am sympathetic to this perspective because it seems to me that what ultimately matters is not whether our methods gratify our intuitions, but rather how well they help us achieve our epistemic and practical goals. On the other hand, the fact that a method has good frequentist properties is not sufficient to ensure that it works well in a more general sense.
Take uniformly most powerful (UMP) tests for instance. A UMP level α test of a given null hypothesis H0 for data drawn from one of a given set of sampling distributions is the test that has the highest probability of rejecting H0 on any of the simple hypothesis that make up the alternative hypothesis among all possible tests that reject H0 with frequency at most α in the long run when H0 is true. The UMP property is attractive, but it is not sufficient for a good test.
Consider the following example. Let X be a random variable that can take values x1, x2, x3, and x4. We are interested in performing a level.05 test of the point null hypothesis H0 against the point alternative Ha, which assert the sampling distributions shown below.
In this case, four nonrandomized level .05 tests are possible: a test that fails to reject H0 regardless of the data; and tests that reject H0 if and only if X=x1, if and only ifX=x2, if and only if X=x3. Among those tests, the one that rejects H0 if and only if X=x1 has the highest probability of rejecting H0 for every simple hypothesis that makes up the alternative hypothesis Ha, namely Ha itself. Thus, it is the UMP level .05 test.
But this test seems clearly unsatisfactory. x1 is more probable under H0 than under Ha. In that sense, H0 accounts for x1 better than Ha does. It thus seems unreasonable to reject H0 in favor of Ha because of x1. Moreover, there are two data points (x2 and x3) that are each more probable under Ha than under H0, and thus would seem to speak against H0 and in favor of Ha more strongly than x1 does. In fact, x1 seems to be the worst data point on which to reject H0 in favor of Ha: x4 is less probable under Ha than under H0, but the ratio of Pr(x4|Ha) to Pr(x4|H0) is very close to one (.98) whereas the ratio of Pr(x1|Ha) to Pr(x1|H0) is only .8.
This example shows that while a UMP level α test might be in an important sense the best test with a given Type I error rate on a given problem, it may be clearly less reasonable than a UMP test at a different α level. In this case, the tests that reject when X=x3 or when X=x4 and the test that rejects when X=x3 or x4 all seem more reasonable than the test that rejects when X=x1.
Preferring the test that rejects when X=x1 over both the test that rejects when X=x3 and the test that rejects when X=x4 makes sense if the cost of a Type II error is much higher than that of a Type I error. Similarly, preferring the test that rejects when X=x1 over the test that rejects when X=x3 or x4 makes sense if the cost of a Type I error is much higher than that of a Type II error. But preferring the test that rejects when X=x1 over all three alternatives would be very strange, and indeed that pattern of preferences cannot maximize expected utility.
The argument I have just given is not an argument against frequentist methods, because frequentists need not and typically do not assume that all UMP tests are sensible. According to standard accounts, a frequentist can choose how to trade off α against βlevel by specifying indifference curves in the α-βplane such that any points on the same indifference curve are not preferred to one another and a preferential ordering of indifference curves. Ranking (α,β) pairs by their expected utility yields indifference curves that are parallel straight lines, but a frequentist regards the expected utility of a test as typically ill-defined because it requires a probability distribution over the hypothesis space. Thus, there is room within frequentist theory for indifference curves that specify a preference for the test that rejects when X=x1. Usually, however, a frequentist would prefer to this test both the test that rejects when X=x3 or x4 and the tests that reject when either X=x3 or X=x4.
Although the argument just given is not a serious objection to frequentist methods, it does raise questions about the claim that frequentist methods are justified by their long-run operating characteristics. According to any statistical school it is better to perform a UMP test level αthan a non-UMP level α test, but UMP tests often do not exist and not all UMP tests are sensible.
Want to keep up with new posts without having to check for them manually? Use the sidebar on the left to sign up for updates via email or RSS feed!