# The Counterexample

Larry Wasserman presents the following purported counterexample to the Likelihood Principle in his lecture notes for a course on theoretical statistics.

**Here’s a concrete illustration of what Wasserman has in mind. **Suppose we want to know the average height of adult male Americans. Because we can’t possibly measure every adult male American, we take a random sample from the population. Suppose (unrealistically) that we are able to carry out the simplest kind of random sampling scheme, in which each individual has an equal probability $\pi$ of being selected. There is a perfectly good way to estimate the average height of adult male Americans from the resulting sample: estimate it as being equal to the average height in the sample. But this procedure is not based on the likelihood function. Because the heights in the population are fixed, the only way chance comes into the data-generating mechanism is in the sampling process. Thus, the only parameter in the likelihood function is the sampling probability $\pi$. The population mean does not appear in the likelihood function at all.

Thus, according to Wasserman, survey sampling provides counterexamples to the Likelihood Principle: **the information about a population mean contained in a simple random sample is not contained in the likelihood function for that sample.**

# My Response

I can see why this kind of case looks like a counterexample to the Likelihood Principle: there is a perfectly sensible procedure for estimating the population mean from the data without a likelihood function linking the two. That procedure is intuitively obvious and reasonable, and it has the frequentist virtue of consistency. For a practicing statistician, the lesson is that inference need not be based on the likelihood function.

However, **I do not accept this case as a genuine counterexample to the Likelihood Principle**, for two reasons:

- It’s not true that
*the*likelihood function in this case does not depend on the population mean. - There is no contradiction between the Likelihood Principle and the claim that one can have a sensible procedure for estimating a quantity from data without a likelihood function linking the two.

I will explain these statements in turn.

## It’s not true that *the* likelihood function does not depend on the population mean

The heights of adult American males at any given time are fixed, not random. Thus, Wasserman claims that the population mean height can play no role in the likelihood function in this example. That likelihood function (up to a constant of proportionality) is simply the probability that the sampling mechanism would select the heights that it did in fact select. Thus, it depends on the sampling probability $\pi$, but not on the population mean $\theta$. Yet we can use those heights to estimate $\theta$, contrary (supposedly) to the Likelihood Principle.

**I reject the claim that because the heights of adult American males are fixed, not random, they can play no role in the likelihood function.** The fixed/random distinction being invoked here does not correspond to a genuine ontological distinction. It does not map onto the distinction between events that are determined by the laws of nature and initial conditions and events that are “genuinely chancy:” we quite reasonably treat the sampling process as random regardless of whether it is genuinely chancy or not. It is not given by the way the world is; it is something we *impose* on the world by specifying a statistical model. Now, some models are sensible and useful while others are crazy. But the fact that some quantity is genuinely “fixed” out there in the world does not mean that we cannot usefully include it in our models. In fact, statisticians include such quantities in their models all the time.

The fact I’m driving toward is that **we could give a likelihood function for the data in Wasserman’s example that does depend on the unknown population mean we are trying to estimate.** For instance, we could give a model according to which the heights of adult American males are normally distributed with a standard deviation of 3 inches. We could then derive a likelihood function for the data that depends on the unknown population mean height and use that likelihood function in estimating that height, perhaps through maximum likelihood or Bayesian estimation.

One might object that a likelihood function produced in this way would not be objectively meaningful. The population mean is what it is. Talk about what the distribution of heights in the population would be if the population mean were different is meaningless, or at least ill-defined.

This objection may be right. But our task is to estimate the population mean. Thus, we’re committed to considering various possible hypotheses about the value of that quantity. Those hypotheses do not confer probabilities on the data until they are placed in the context of a model that gives rise to likelihood functions. Thus, it seems reasonable that we need such a model before we can talk about the evidential meaning of the data with respect to those hypotheses.

## There is no contradiction between the Likelihood Principle and the claim that one can have a sensible procedure for estimating a quantity from data without a likelihood function linking the two

At this point one might object that, again, we don’t need a model in order to estimate the population mean: use of the sample mean is perfectly sensible and (in the technical sense) consistent.

I have two responses to this objection. First, **using the sample mean as an estimator would not necessarily be sensible**. To take an extreme example, suppose one wanted (for some reason) to estimate the average mass of the multicellular organisms in a room containing an elephant with fleas. A small (relative to the number of fleas) simple random sample of the multicellular organisms in that room would very probably yield a sample mean much smaller than the population mean. Proper modelling would allow one to account for this fact.

More generally, using the sample mean as an estimator may not be a good idea in highly skewed populations.

Second, **the Likelihood Principle is compatible with the claim that one can have a sensible procedure for estimating a quantity from data without a likelihood function linking the two**. There is no contradiction between (1) the claim that the Likelihood Principle is a true sufficient condition for evidential equivalence and (2) the claim that that statistical methods should be chosen on the basis of their long-run operating characteristics. Moreover, one could say that using the sample mean as an estimate of average adult American male height is sensible even in the absence of a *particular* likelihood function because it would be sensible on *any* reasonable likelihood function, given our knowledge that the distribution of heights is not highly skewed.

# Conclusion

Wasserman’s example illustrates the point that sensible statistical practice does not always require specifying a particular likelihood function. However, it does not speak against the claim made by advocates of the Likelihood Principle that the evidential meaning of a datum with respect to a set of hypotheses depends only on the probabilities that those hypotheses ascribe to that datum.

Thanks to Adam Brodie and Dan Malinsky for bringing this example to my attention and discussing it with me.

To share your thoughts about this post, comment below or send me an email.

Comments support Markdown formatting and $\LaTeX$ mathematical expressions (surround expressions with single dollar signs for in-line math or double dollar signs for display math).

Michael Lew says

This is a very nice treatment of the alleged counter-example. I find it very convincing.

It seems to me that Wasserman’s counter-example would never have been proposed if it was conventional to state the likelihood principle in the manner that Edwards uses. Given a statistical model, all of the evidence in the data relevant to the parameter of interest is in the relevant likelihood function. (A paraphrase because my copy of his book is in my office, not at my breakfast table!)

As you show, Wasserman lacked a statistical model and generated a likelihood function that is not the relevant one for his parameter of interest.

David Rohde says

Interesting. I find you convincing.

… but do you have anything to say about this more complex example: http://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/

Greg Gandenberger says

I haven’t thought about that one yet. I’ll add it to my to-do list. Thanks!

David Rohde says

I have wasted quite a bit of time trying to form an opinion about it, and could lose plenty more….

FWIW at this point in time I see it like this:

There is a lot of wisdom in both the writing of Wasserman and Sims.

I like to think of the problem in terms of why when we use Monte Carlo methods we violate the likelihood principle i.e. we take into account the form of the proposal distribution. If you are familiar with Monte Carlo methods the problem can be thought of as drawing from a proposal distribution using rejection sampling and then using importance sampling to estimate the marginal likelihood. (it is slightly different in the sense that the number of accepted proposals is random not fixed).

When Wasserman says things like “Bayes fails” he means that the Bayesian methods have poor frequentist properties when the prior puts independent distributions on $\theta$ and $\pi$. Its implied that the Bayesian answer is suspect because of this and it seems Sims accepts this – I am less convinced.

I also am skeptical of Sims arguments for putting a joint prior on $\theta$ and $\pi$.

In one of Sims papers he states “Examples where likelihood-based inference inevitably leads to bad estimators that are clearly worse than estimators that cannot be derived from a likelihood-based approach are rare, possibly because they do not exist.” – As one of the central idea of Bayesian statistics is _against_ the idea of point estimation – I don’t think this is a necessary path to go down in order to fail to see this as a counter example to Bayesian inference.

Back to the Monte Carlo analogy this paper is relevant: mlg.eng.cam.ac.uk/zoubin/papers/RasGha03.pdf. Its a very similar problem but with continuous support. In this case the authors argue in favour of ignoring the proposal distribution i.e. $\pi$ and develop a method that seems to work well in modest dimensions. I do wonder if the the theorem of Robins and Ritov means that this method must have poor frequentist properties (particularly in high dimensions) and what this means in practice though…

As I said, I am struggling with this myself so no one should defer their opinion to me. In fact I think I am principally deferring my opinion to Christian Robert as given (very briefly) in the last few sentences of this post: https://xianblog.wordpress.com/2013/01/17/robbins-and-wasserman/

Greg Gandenberger says

Thanks, David! Looking into this example is not my top priority at the moment, but it will be at some point down the line. I’ll come back to this comment at that time.

David Rohde says

That is super understandable, and congratulations!