Larry Wasserman presents the following purported counterexample to the Likelihood Principle in his lecture notes for a course on theoretical statistics.
Here’s a concrete illustration of what Wasserman has in mind. Suppose we want to know the average height of adult male Americans. Because we can’t possibly measure every adult male American, we take a random sample from the population. Suppose (unrealistically) that we are able to carry out the simplest kind of random sampling scheme, in which each individual has an equal probability $\pi$ of being selected. There is a perfectly good way to estimate the average height of adult male Americans from the resulting sample: estimate it as being equal to the average height in the sample. But this procedure is not based on the likelihood function. Because the heights in the population are fixed, the only way chance comes into the data-generating mechanism is in the sampling process. Thus, the only parameter in the likelihood function is the sampling probability $\pi$. The population mean does not appear in the likelihood function at all.
Thus, according to Wasserman, survey sampling provides counterexamples to the Likelihood Principle: the information about a population mean contained in a simple random sample is not contained in the likelihood function for that sample.
I can see why this kind of case looks like a counterexample to the Likelihood Principle: there is a perfectly sensible procedure for estimating the population mean from the data without a likelihood function linking the two. That procedure is intuitively obvious and reasonable, and it has the frequentist virtue of consistency. For a practicing statistician, the lesson is that inference need not be based on the likelihood function.
However, I do not accept this case as a genuine counterexample to the Likelihood Principle, for two reasons:
- It’s not true that the likelihood function in this case does not depend on the population mean.
- There is no contradiction between the Likelihood Principle and the claim that one can have a sensible procedure for estimating a quantity from data without a likelihood function linking the two.
I will explain these statements in turn.
It’s not true that the likelihood function does not depend on the population mean
The heights of adult American males at any given time are fixed, not random. Thus, Wasserman claims that the population mean height can play no role in the likelihood function in this example. That likelihood function (up to a constant of proportionality) is simply the probability that the sampling mechanism would select the heights that it did in fact select. Thus, it depends on the sampling probability $\pi$, but not on the population mean $\theta$. Yet we can use those heights to estimate $\theta$, contrary (supposedly) to the Likelihood Principle.
I reject the claim that because the heights of adult American males are fixed, not random, they can play no role in the likelihood function. The fixed/random distinction being invoked here does not correspond to a genuine ontological distinction. It does not map onto the distinction between events that are determined by the laws of nature and initial conditions and events that are “genuinely chancy:” we quite reasonably treat the sampling process as random regardless of whether it is genuinely chancy or not. It is not given by the way the world is; it is something we impose on the world by specifying a statistical model. Now, some models are sensible and useful while others are crazy. But the fact that some quantity is genuinely “fixed” out there in the world does not mean that we cannot usefully include it in our models. In fact, statisticians include such quantities in their models all the time.
The fact I’m driving toward is that we could give a likelihood function for the data in Wasserman’s example that does depend on the unknown population mean we are trying to estimate. For instance, we could give a model according to which the heights of adult American males are normally distributed with a standard deviation of 3 inches. We could then derive a likelihood function for the data that depends on the unknown population mean height and use that likelihood function in estimating that height, perhaps through maximum likelihood or Bayesian estimation.
One might object that a likelihood function produced in this way would not be objectively meaningful. The population mean is what it is. Talk about what the distribution of heights in the population would be if the population mean were different is meaningless, or at least ill-defined.
This objection may be right. But our task is to estimate the population mean. Thus, we’re committed to considering various possible hypotheses about the value of that quantity. Those hypotheses do not confer probabilities on the data until they are placed in the context of a model that gives rise to likelihood functions. Thus, it seems reasonable that we need such a model before we can talk about the evidential meaning of the data with respect to those hypotheses.
There is no contradiction between the Likelihood Principle and the claim that one can have a sensible procedure for estimating a quantity from data without a likelihood function linking the two
At this point one might object that, again, we don’t need a model in order to estimate the population mean: use of the sample mean is perfectly sensible and (in the technical sense) consistent.
I have two responses to this objection. First, using the sample mean as an estimator would not necessarily be sensible. To take an extreme example, suppose one wanted (for some reason) to estimate the average mass of the multicellular organisms in a room containing an elephant with fleas. A small (relative to the number of fleas) simple random sample of the multicellular organisms in that room would very probably yield a sample mean much smaller than the population mean. Proper modelling would allow one to account for this fact.
More generally, using the sample mean as an estimator may not be a good idea in highly skewed populations.
Second, the Likelihood Principle is compatible with the claim that one can have a sensible procedure for estimating a quantity from data without a likelihood function linking the two. There is no contradiction between (1) the claim that the Likelihood Principle is a true sufficient condition for evidential equivalence and (2) the claim that that statistical methods should be chosen on the basis of their long-run operating characteristics. Moreover, one could say that using the sample mean as an estimate of average adult American male height is sensible even in the absence of a particular likelihood function because it would be sensible on any reasonable likelihood function, given our knowledge that the distribution of heights is not highly skewed.
Wasserman’s example illustrates the point that sensible statistical practice does not always require specifying a particular likelihood function. However, it does not speak against the claim made by advocates of the Likelihood Principle that the evidential meaning of a datum with respect to a set of hypotheses depends only on the probabilities that those hypotheses ascribe to that datum.
Thanks to Adam Brodie and Dan Malinsky for bringing this example to my attention and discussing it with me.
To share your thoughts about this post, comment below or send me an email.
Comments support Markdown formatting and $\LaTeX$ mathematical expressions (surround expressions with single dollar signs for in-line math or double dollar signs for display math).