In my previous post I presented some reasons to resist a clever counterexample to the Law of Likelihood developed by Mike Titelbaum. In that post I chose to stay at the level of intuitions about the example and about what kinds of features we might want a measure of evidential favoring to have. **In this post I go deeper by examining Mike’s example in light of the purpose of the Law of Likelihood.**

### Recap of the Example

The Law of Likelihood says that evidence $E$ favors hypothesis $H_1$ over hypothesis $H_2$ if and only if $\Pr(E|H_1)>\Pr(E|H_2)$.

Here is Mike’s counterexample, as he presented it to me in personal correspondence (shared with permission).

We’re playing Hearts (with a standard deck). At the beginning of the game Branden [Fitelson, who was also involved in the discussion,] passes me one card face down. I hate scoring any points in Hearts. It turns out that the Two of Hearts hardly ever yields points for its bearer, so if I receive the Two of Hearts from Branden I’m mildly annoyed. However if I receive a different heart, or the Queen of Spades, I’m really pissed off.

Now suppose you catch a glimpse of the card Branden passes me, and see only that it’s a heart. That’s your evidence; the two (mutually exclusive) hypotheses are that I’m mildly annoyed or that I’m really pissed off.

Mike claims that, intuitively, the fact ($E$) that the card is a heart favors the hypothesis ($H_1$) that he is really pissed off over the hypothesis ($H_2$) that he is mildly annoyed. However, **the Law of Likelihood says the opposite:** it says that $E$ favors $H_2$ over $H_1$ because $\Pr(E|H_2)=1>\Pr(E|H_1)=12/13$.

### Recap of My Previous Post

Mike’s example suggests that the Law of Likelihood can fail to capture our intuitions about evidential favoring: a strong majority (13/18, 72%) of respondents to a pair of (unscientific) polls said that the Law of Likelihood gives the wrong result in that example. Thus, to the extent that my polls and our intuitions about evidential favoring are reliable, Mike’s example shows that likelihoodist methods fail in their aim of characterizing data as evidence.

In my previous post, I gave reasons to think that these intuitions about evidential favoring might not be reliable in the case of Mike’s example. It seems plausible that intuitions about Mike’s example are unduly influenced by the fact that the posterior probability of $H_1$ is high while that of $H_2$ is low. Moreover, all of the general reasons to regard the Law of Likelihood as a good account of evidential favoring apply in Mike’s example. I for one find it not at all intuitively clear that the Law of Likelihood gives the wrong verdict in Mike’s example when I hold these considerations in mind.

### What is the purpose of the Law of Likelihood?

More importantly, I am enough of a pragmatist to think that what ultimately matters is not how well a given measure of evidential favoring captures our intuitions, but how well it serves our purposes. A strong counterexample in which the Law of Likelihood said something obviously crazy would provide at least a moderately strong reason to think that it would not serve our purposes very well, at least in some cases. A weaker counterexample in which it merely said something somewhat counterintuitive would provide a correspondingly weak reason to doubt its usefulness. For the reasons given in my previous post, I think that Mike’s example is in this sense a weak counterexample at best.

For likelihoodists such as Edwards (1972), Royall (1997), and Sober (2008), the purpose of the Law of Likelihood is to provide the basis for an alternative to Bayesian and frequentist methodologies for science. That methodology provides objective characterizations of data as evidence, unlike either frequentist or Bayesian methodologies. It is supposed to be a genuine *alternative* to Bayesian and frequentist methodologies, meaning that it involves regarding likelihood ratios as of interest in their own right and not merely as useful frequentist test statistics or as inputs for Bayesian updating.

A positive reason to hang on to the Law of Likelihood despite Mike’s example is that **it is the only real candidate for an account of evidential favoring that can provide the basis for an alternative to Bayesian and frequentist methodologies.** Rival accounts such as the claim that $E$ favors $H_1$ over $H_2$ if and only if $\Pr(E|H_1)/\Pr(E|\sim H_1)>\Pr(E|H_2)/\Pr(E|\sim H_2)$ cannot serve this purpose. Those accounts appeal to quantities such as $\Pr(E|\sim H_1)$ that are not objective and are difficult even to specify subjectively in typical cases in science. In the case of $\Pr(E|\sim H_1)$ in particular, one would typically assign it a value by specifying a set of alternative hypotheses $H_2, H_3, \ldots$ to $H_1$ such that $\Pr(E|H_i)$ is objectively given for each $i=2,3,\ldots$, putting a prior probability distribution over $H_1,H_2,\ldots$, and using the expression $\Pr(E|\sim H_1)=\sum_{i>1}\Pr(E|H_i)\Pr(H_i)$, treating the set ${H_1,H_2,\ldots}$ as if it were exhaustive. But if one has a prior probability distribution over $H_1,H_2,\ldots$ that one is willing to use for this purpose, then there does not seem to be any reason to bother characterizing $E$ as evidence with respect to $H_1$ and $H_2$: one could instead give the posterior odds for $H_1$ and $H_2$, which are more immediately relevant for belief and action and are actually easier to specify in that the do not depend on either $\Pr(H_i)$ or $\Pr(E|H_i$) for any $i>2$. If one prefers to give a measure of evidential favoring rather than a posterior odds so that the members of one’s audience can use that measure of favoring to update their own personal prior probability distributions, then again the Law of Likelihood will be more suitable than its rivals for that purpose because (1) it does not (typically) depend on one’s own prior probability distribution and (2) it has a nice Bayesian interpretation as the ratio of posterior to prior odds.

## Caveats

I have argued elsewhere that the Law of Likelihood cannot in fact provide the basis for a genuine alternative to Bayesian and frequentist methodologies. If I am right in that claim, then this reason for hanging on to the Law of Likelihood is ultimately moot. However, the idea of a methodology that yields objective characterizations data as evidence is *prima facie* highly appealing, and I am anxious not to dismiss it prematurely, as I think one would be doing if one were to reject the Law of Likelihood on the basis of an intuitive reaction to Mike’s example.

Perhaps some alternative to the Law of Likelihood can serve some other useful purpose than to provide an alternative to Bayesian and frequentist methodologies for science. For instance, Branden Fitelson suggested in a comment on a previous post that some other measure (perhaps $[\Pr(E|H_1)\Pr(E|\sim H_2)]/[\Pr(E|\sim H_1)\Pr(E|H_2)]$?) could serve as a measure of “comparative argument strength” in inductive logic. If such a measure gives the more intuitive verdict in Mike’s example, then perhaps that fact gives it some boost in initial plausibility. What really matters, though, is whether or not it can serve some truly significant purpose.

To share your thoughts about this post, comment below or send me an email.

Comments support $\LaTeX$ mathematical expressions: surround with single dollar signs for in-line math or double dollar signs for display math.

Michael Lew says

Greg, I’m really uncomfortable with your treatment of the law of likelihood as something that applies to hypotheses that are collections of parameter values. I don’t think that Fisher, Edwards or Royall treat likelihood in a manner that allows your interpretation.

My concern is not simply that you are not following the “authorities”, but that you are ignoring the arbitrary scaling of likelihood: the likelihood of a hypothesis is \textbf{proportional} to the probability of the observation assuming the hypothesis. Proportional to, not equal to. As I have previously pointed out, where hypotheses are parameter values on a likelihood function, like the values of theta in the illustration that you put at the top of this post, then they necessarily share a proportionality constant. Where the hypotheses exist on separate likelihood functions then the constants cannot be assumed equal.

Your hypotheses do not seem to be similar in nature to the various values of theta in the likelihood function that you show. (Yes, I know that the graph is not meant to illustrate the particulars of the problem at hand, but it does nicely illustrate the problem.) What would the relevant likelihood function look like? It seems to me that you assume it is two points only: one which represents the probability of the observation given the hypothesis of one particular card being passed; the other represents the probability of the observation given a hypothesis which is that any of thirteen cards is passed. Are you confident that the unknown proportionality constants are identical?

Jonathan Livengood says

Could you explain what you mean by saying that a hypothesis is or is not on a likelihood function? I’m trying to understand your worry about Greg’s example, but I’m not seeing it at all. Could you give some simple concrete examples that illustrate the contrast between hypotheses existing on a single likelihood function and hypotheses existing on separate likelihood functions? And then show us what is supposed to be going wrong in those simple cases?

Michael Lew says

Jonathan, in Greg’s graph at the top of the post the hypotheses are the various values of theta. Those hypotheses are clearly simple parameter values and they are equally clearly points on a single likelihood function. The proportionality constant for that likelihood function is unknown, as is usually the case, but it is shared by all of the points on the likelihood function. Thus it is cancelled out when the likelihoods of the various values of theta (the hypotheses) are expressed as ratios. (Remember that the law of likelihood refers to the ratio of likelihoods as the measure of evidential favouring.)

I think that the hypotheses concerning Mike’s levels of pissed-offness are functionally and logically different from the values of theta. In particular, they differ in that their relationship with the observation is more complicated and it differs for ‘mildly’ and ‘really’ pissed off, as I’ll try to explain.

Greg doesn’t say what theta represents in his graph, so let’s assume for simplicity that it represents hypothesised values for mu, the mean of the population of interest. The observation is then likely to be the sample mean. The sample mean is a simple predictor of the population mean. For Mike’s alleged counter-example to the law of likelihood the observation is ‘hearts’ whereas the parameter of interest takes (at least) the values ‘mildly pissed off’ and ‘really pissed off’. The observation does not predict the parameters in a simple manner, but does so through two dimensions of the sample: the suit and the face value of the card passed.

It is probably possible to view the relevant likelihood function as having only two dimensions by assuming a single x-axis which is a simple listing of the 52 possible cards, but that doesn’t overcome the objection that I will detail in my next comment.

Michael Lew says

The hypotheses H1 (really pissed-off, which I’ll refer to as Hrpo from now) and H2 (mildly annoyed, Hma) in the example are treated as if they were simple hypotheses which are points on a likelihood function. Otherwise the law of likelihood would not have anything to say about them. However, my contention is that Hrpo is a composite hypothesis that cannot be a point on the same likelihood function as Hma and so the law of likelihood is silent about the evidence favouring them.

As the observation relates to the cards rather than the state of mind of Mike we must map the hypotheses onto the likelihood function that relates to the card values. Thus Hma is equivalent to the hypothesis that the card passed is the two of hearts, a hypothesis that I’ll call H2h. Hrpo is different. It is equivalent to the hypothesis that the card passed is the queen of spades or the three of hearts or the four of hearts … or the ace of hearts. It is a composite hypothesis.

If we assume H2h is true then the probability of observing the passed card to be a heart is one. If we assume Hrpo to be true then the probability of that the passed card is a heart is unknown because it depends on the probability of it being the queen of spades, the three of hearts and so on. Those probabilities are assumed to be equal in the treatment of the problem that yields a likelihood for Hrpo of 12/13, but in reality their probabilities are far from equal, as a good player is unlikely to pass the queen of spades unless they are short of other spades, and high ranking hearts are far more likely to be passed than low ranking hearts. Even ignoring the realities of the card game the probabilities are unknown. Thus I do not see that we can have a likelihood for Hrpo.

Jonathan Livengood says

What you say about the realities of the card game are eminently sensible.

I still don’t get the worry about simple versus composite hypotheses, though. Take the beta distribution in Greg’s graphic. Suppose we had a prior on the bias of a coin and then saw 50 heads and 25 tails. We updated using the beta as a conjugate to get the posterior in Greg’s picture. Why can’t I use the law of likelihood to compare hypotheses of the form, “The bias is at least 75% towards Heads,” or “The bias is between 15% towards Heads and 35% towards Heads”?

If such comparisons are sensible, then what is the problem — setting aside the worries about how the game of Hearts is actually played — with using the law of likelihood to compare the hypotheses that Greg considers?

Michael Lew says

Jonathan, if your hypotheses were about values on the scale of proportions then the relevant likelihood function would also be on the scale of proportions. There would be no problem with that. You would be able to evaluate the evidential favouring of any points on that likelihood function, but it seems that you would not be able to use the law of likelihood to quantitate the evidential favouring because of the interval nature of the hypotheses.

Greg Gandenberger says

Good point–the story behind the example doesn’t support a uniform probability distribution over the card types. We could easily modify the story to get rid of that problem, though. For instance, we could forget about Hearts and just say that a card is drawn at random from a standard desk, and ask about the hypothesis that the card is a two of hearts vs. the hypothesis that it is a three or higher of hearts or a queen of spades. Would you still have concerns about the application of the Law of Likelihood?

Michael Lew says

When you remove the pissed-offness layer of hypothesis the likelihood function becomes easy and the law of likelihood tells you which cards are favoured by the evidence. It still cannot quantify the favouring of a particular card relative to a range of cards, though, because you still cannot provide the probabilities for the cards within the range.

Michael Lew says

Sorry, my comment now seems a little off the mark. The problem with likelihoods for composite hypotheses seems to be related to the fact that likelihoods are not ‘proper’ probabilities in that they do not comply with Kolmogorov’s axioms of probability: they do not have to sum or integrate to unity. The fact that they do not sum to unity means that the scaling of likelihoods is always arbitrary, with an unknown proportionality constant. I your graph at the top you have scaled the likelihood function to integrate to unity (well, it looks that way) but you could equally well have scaled it to have a maximum of unit, as Edwards and Royal both choose to routinely do. I sometimes like to scale a likelihood functions so that the likelihood at the null hypothesis is unity so that the height of the function represents the strength of evidence against the null. Any way you choose to do it the scale of likelihood is arbitrary.

The probability that a single randomly drawn heart card is in the set of queen of spade and all hearts other than the two of hearts is obviously higher than the probability that it is the two of hearts. But you calculate the likelihood as 12/13 in the first case and 1/1 in the second. That is the nub of the alleged counter-example. I offer two solutions.

First, the relevant likelihoods are not _equal_ to those probabilities but _proportional_ to them, and the proportionality constants differ and are unknown. That will be the case if the hypotheses are on different likelihood functions (and I think that they are).

Second, the law of likelihood simply does not apply to comparisons of simple hypotheses with composite hypotheses, or to composite hypotheses in general.

Michael Lew says

Greg, I’ve found this in Royall’s book (pages 17-18):

“The law of likelihood explains how an observation on a random variable should be interpreted as evidence in relation to two simple statistical hypotheses. It also applies to some composite hypotheses, such as $H_C$ in section 1.7. But it does not apply to composite hypotheses generally.”

Greg Gandenberger says

Thanks for the comments, Michael. Maybe someday we’ll get this disagreement worked out!

The reason I don’t accept your claim that the likelihoods for $H_1$ and $H_2$ in the example can’t be compared is that I don’t see why they can’t be arguments of a single likelihood function. In general, I don’t see why any pair of mutually exclusive hypotheses couldn’t be arguments of a single likelihood function.

I don’t see how appealing to the distinction between simple and composite hypotheses helps. Every hypothesis is equivalent to a disjunction of more specific hypotheses, so every hypothesis is in some sense a composite hypothesis. Hypotheses are only simple or composite relative to models, and it’s possible to provide a model relative to which $H_1$ and $H_2$ are both simple.

What is the principled, non-model-relative distinction between simple and composite hypotheses, or between pairs of hypotheses that do and do not lie on common likelihood functions? I don’t see it, but that could be my fault.

Michael Lew says

Greg, I am making a claim that the claim that you cannot apply the law of likelihood to a comparison of a simple hypothesis with a composite hypothesis is not my claim, but a claim of Rishard Royall (and possibly others as well, but I leave the task of finding out to you).

Saying that it is possible to make a model under which H1 and H2 are both simple is not the same as having done so. What is a model under which your H1 and H2 are both simple? What likelihood function does that model yield?

Likelihoods are always calculated within a statistical model and the law of likelihood can only apply to likelihoods generated according to the same model. That is why I keep insisting that they have to be part of a single likelihood function.

I think that your likelihoods for H1 and H2 come from two distinct models. If you have different statistical models for calculating the likelihood of H1 and H2 then the ratio of those likelihoods is not meaningful.

Greg Gandenberger says

When I say “model,” I mean a set of mutually exclusive and exhaustive hypotheses, a set of possible observations, and a conditional probability distribution over those possible observations for each of those hypotheses. You can easily produce a model in this sense that has $H_1$ and $H_2$ as hypotheses, “heart” as a possible observation, and $\Pr(heart|H_1)=12/13$ and $\Pr(heart|H_2)=1$ as points in the relevant conditional probability distributions.

Is such a model illegitimate? If so, why?

Michael Lew says

That model is legitimate but it is not a model of the cards (it is not exhaustive) and it simply obscures the fact that one of the hypotheses is composite and the other simple.

The nub of our dispute is how to deal with a comparison of likelihoods where one hypothesis is simple and the other is composite, so I can’t see how it helps to simply make a model where the composite hypothesis is dressed up as simple.

Greg Gandenberger says

Thanks for continuing the conversation, Michael. You write that the model I indicated “simply obscures the fact that one of the hypotheses is composite and the other simple.” But it seems to me that hypotheses are only ever simple relative to models. You could say that $H_1$ is composite because it is the disjunction of the hypothesis that the card is the two of hearts and the moon is made of green cheese and the hypothesis that the card is the two of hearts and the moon is not made of green cheese. It’s not *really* simple any more than $H_2$ is. It’s just simple relative to a more natural and obvious model. Or is it simple in some non-model-relative sense that I’m missing?

The problem with composite hypotheses that concerned Royall is that we sometimes don’t know what value to ascribe to $\Pr(E|H)$ when $H$ is composite. But in this case (with a little work) we don’t have that problem.

Michael Lew says

Greg, having slept on our problem I now feel that I have a clearer picture of why we disagree.

Let’s start with a likelihood function for the cards problem that we can agree about, the function that shows the likelihoods for each simple, single card hypothesis. We could list the cards in any order along a single x-axis but it is convenient to clump the cards that lead to really pissed off together. After the observation of ‘hearts’ the likelihood function take the value one for all hearts cards and zero for all others. Presumably we agree.

Now we need to obtain a likelihood for the composite hypothesis which includes the queen of spades and the twelve non-two hearts cards. You give it a likelihood of 12/13 on the basis that that is the probability of the card being a heart if the hypothesis is true. However, I disagree. The likelihood is proportional to 12/13, but it is not equal to it.

I will justify my disagreement using the likelihood function at the top of your post. Say I have two hypotheses: H1 says that theta = 0.4; and H2 says that theta > 0.4. (H1 is simple and H2 is composite.) Inspection of the function will no doubt convince you that the likelihood of H1 is approximately 2. That likelihood is clearly not generated by just answering the question “what is the probability of the observation assuming that theta is 0.4?” because it is greater than one and so it is not a probability. That’s OK, because the function has been scaled to yield an integral of unity and we could scale it using any proportionality function we wished. We can proceed without re-scaling it.

What is the likelihood of H2? If we want to match the 12/13 value that you provide for Hrpo in the card example then we need to calculate the average value of the likelihood function for values of theta in excess of 0.4. That’s easy, it’s virtually zero because the theta scale goes to infinity and so the likelihood over most of the scale is infinitesimally small. Lets call the likelihood ‘zeroish’ for clarity.

Now we have un-scaled likelihoods for our two hypotheses, H1 and H2, of 2 and zeroish. What would the law of likelihood say about the evidence if it were applied to those values? It would say that that the evidence support the hypothesis of theta = 0.4 almost infinitely more strongly than it supports the hypothesis that theta is greater than 0.4. That is clearly nonsense, as inspection of the likelihood function at the top of your post shows that the most strongly supported value of theta is about 0.5.

Not only do we have a nonsense likelihood ratio for the observation that yielded the likelihood function that you display, but we have a calculation whereby there is no possible observation that would yield support for the composite hypothesis, H2, that is as large as the support for the simple H1. The problem comes from the averaging of likelihoods of the component hypotheses that make up the composite hypothesis. It seems that taking the average is not a valid maneuver. It is not valid for the continuous likelihood function shown at the top of your post and it is not valid for the likelihood function for the card problem.

Let me say here that I do not actually know what would be the correct values for the hypotheses H1 and H2. I suspect that it is not possible to get a comparable pair of likelihood by any method because the simple and composite hypotheses are not comparable using likelihoods and the law of likelihood.

That is not to say that composite hypotheses cannot be compared with other composite hypotheses, though. Consider a third hypothesis, H3, that says theta is less than 0.4. Its likelihood value is, like H2, zeroish. But it is a slightly smaller zeroish than that of H2 because it contains less of the likelihood function density. We can obtain the ratio of those two zeroish values as the ratio of the integrals of the function that they contain. The evidence supports H2 more strongly than it supports H3.

It seems to me that the informal lesson that comes from my example is that for the likelihoods to be comparable their hypotheses need to have equivalent ‘span’ along the x-axis of the likelihood function. Thus I conclude that the counter-intuitive results that come from Mike Titelbaum’s alleged counter-example come from a misapplication of the law of likelihood to a pair of likelihood values that are not scaled to be directly comparable.

Jonathan Livengood says

Michael,

What you’re saying isn’t engaging with what Greg has been writing. He’s not writing about the values of the likelihood function, he’s explicitly talking about a rule that involves probabilities. Not values proportional to probabilities. Probabilities. Greg explicitly defines the law of likelihood as follows:

The Law of Likelihood says that evidence E favors hypothesis H1 over hypothesis H2 if and only if Pr(E|H1) > Pr(E|H2).

Those are conditional probabilities, not values of a likelihood function.

For this reason, your remarks on the beta distribution example given graphically at the top of the post are seriously off the mark. The composite hypothesis you consider — that theta is greater than 0.4 — is very strongly favored over the hypothesis that theta is exactly 0.4 according to the law of likelihood as Greg has stated it. In fact, trivially so. The probability that theta is exactly 0.4 given 25 observed successes and 25 observed failures is just zero. By contrast, the probability that theta is greater than 0.4 given the evidence is approximately 0.922.

Jonathan Livengood says

Greg,

Incidentally, isn’t it problematic for the law of likelihood as you have stated it that if we flip a coin and observe 25 Heads and 25 Tails, the evidence does not favor the hypothesis that the bias of the coin is exactly 0.5 for Heads over the hypothesis that the bias of the coin is exactly 0 for Heads? Is there a standard way of handling such problem cases?

Michael Lew says

Jonathan, I think that Greg is a very clever man and I respect his opinion, but it is not within his purview to redefine the law of likelihood as a law of conditional probabilities!

Seriously, the law of likelihood specifies likelihoods and ratios of likelihoods.

It is commonplace to ignore the fact that a likelihood is only defined to a proportionality constant and thus to think that it is equal to the probability. That would be convenient but it is not the true state of the world.

At the risk of excessive repetition, I will point out the fact that if the two likelihoods in question are simple points on a single likelihood function then their ratio will cancel out the proportionality constants and will be the same number as the ratio of probabilities in your (mis-) definition of the law of likelihood.

Greg Gandenberger says

Thanks for the comments, Michael and Jonathan.

From Michael:

The average needs to be weighted by the prior probability according to the rule $f(E|\theta>0.4)=\int_{.4}^\infty g(E|\theta)h(\theta|\theta>0.4)d\theta$, where $f$, $g$, and $h$ are the relevant density functions. The result won’t be zeroish; it will be greater than 2.

Again, it seems to me that any pair of mutually exclusive hypotheses can belong to a common likelihood function.

From Jonathan:

That’s right. As an aside, Branden Fitelson has pointed out to me that there’s some work to be done in clarifying whether likelihood functions should be understood as conditional probabilities ($\Pr(E|H)$) or as probabilities entailed by hypotheses (sometimes written $\Pr(E;H)$ or $\Pr_H(E)$). If they are understood as conditional probabilities, then likelihoodists have old evidence problems. If not, then the Likelihood Principle is not easily understood as a Bayesian principle. (This paper is helpful for getting a handle on the relevant issues.) But I think we can put that issue aside for now by restricting our focus to cases in which $\Pr(E|H)$ and $\Pr(E;H)$ are numerically the same even if they are conceptually different things.

I think there’s been a miscommunication at some point. If the number of tosses is fixed at 50 and X is the number of heads, then $\Pr(X;p)\propto p^X(1-p)^{50-X}$, which is maximized at $p=X/n$. Thus, the evidence would favor the hypothesis that the bias of the coin is exactly 0.5 for Heads over the hypothesis that the bias of the coin is exactly 0 for Heads. In fact, it would do so to a maximal degree because $\Pr(X=25;p=0)=0$.

Jonathan Livengood says

Oh. Yeah. Stupid of me. Ignore that comment. 😉