### Welcome, Daily Nous readers!

# Introduction

I have been recommending the first chapter of Elliott Sober’s Evidence and Evolution to those who ask for a good introduction to debates about statistical inference. That chapter is excellent, but it would nice to be able to recommend something shorter that is readily available online. Here is my attempt to provide a suitable source. I presuppose some familiarity with probabilities but no formal training in probability theory.

Statistical inference is an attempt to evaluate a set of probabilistic hypotheses about the behavior of some data-generating mechanism. It is perhaps the most tractable and well-studied kind of inductive inference. The three leading approaches to statistical inference are Bayesian, likelihoodist, and frequentist. All three use likelihood functions, where the likelihood function for a datum $E$ on a set of hypotheses $\textbf{H}$ is $\Pr(E|H)$ (the probability of $E$ given $H$) considered is a function of $H$ as it varies over the set $\textbf{H}$. However, they use likelihood functions in different ways and for different immediate purposes. Likelihoodists and Bayesians use them in ways that conform to the *Likelihood Principle* (see Gandenberger 2014), according to which the evidential meaning of $E$ with respect to $\textbf{H}$ depends only on the likelihood function of $E$ on $\textbf{H}$, while frequentists use them in ways that violate the Likelihood Principle. Likelihoodists use likelihood functions to characterize data as evidence. Bayesians use them to update probability distributions. Frequentists use them to design experiments that have are in some sense guaranteed to perform well in repeated applications in the long run.

I start with a real example that illustrates why these issues matter. I then discuss a fictitious simplified variant on that example to illustrate how the Bayesian, likelihoodist, and frequentist approaches work in typical cases. In my next post, I will discuss a stranger example that better illustrates how those approaches can come apart.

# A Real Motivating Example

In the 1980s, infants showing a particular pattern of respiratory problems had about a 20% survival rate until a team of researchers led by Robert Bartlett developed a new therapy called ECMO (extracorporeal membrane oxygenation) that led to the survival of seventy-two of the first hundred patients on whom they tried it, the first fifty of whom had already failed to respond to conventional therapy.

Despite their early successes, conventional standards of scientific evidence required Bartlett et al. to perform a randomized clinical trial in which ECMO and conventional treatments were used side-by-side in the same clinical setting and patient population. Concerned about the ethics of continuing to use the seemingly inferior conventional treatment, Bartlett et al. used an innovative “randomized play-the-winner” trial design that adjusted the probability that a given patient would receive a given treatment as the trial went along so that the treatment that had performed the best in the trial so far would be favored. The result was that all eleven infants given ECMO survived, and the one given conventional therapy died.

This result too looked rather compelling given available background knowledge, but because only one patient received conventional therapy it did not meet the conventional standard for establishing the efficacy of a new treatment. As a result, Ware led a second randomized trial. He was also concerned about the ethics of continuing to use the seemingly inferior conventional treatment, so he designed his trial to have two phases: it would be randomized until four patients died on either treatment, and then it would continue using exclusively the other treatment. The result was that 28 out of 29 patients receiving ECMO survived, while 6 of 10 receiving conventional therapy died.

That result also looks convincing, but it too failed to meet the conventional standard for establishing the efficacy of a new treatment. As a result, a group of researchers in the UK carried out a third randomized trial. Not surprisingly, that trial had to be terminated when early results clearly indicated ECMO’s superiority, but not until fifty-four more infants had died under conventional therapy.

As the parent of a child who was hospitalized with severe respiratory problems in the first month of life, this story makes my blood boil. **It illustrates the enormous costs of the failure of philosophers, statisticians, and scientists to reach consensus on a reasonable, workable approach to statistical inference in science.**

The standard of evidence that led to this debacle was a frequentist one. However, the example does not provide a knockdown argument against frequentist approaches generally, but only against the rigid and simplistic way in which frequentist ideas were applied in this particular case. Frequentist methods of meta-analysis, for instance, could have been used to pool the results of the first two trials and to make a case against the need for a third trial. That being said, one great advantage that likelihoodist and Bayesian methods have over frequentist methods is that they make it much easier to combine data from disparate sources.

# A Simple Illustrative Example

I will now present a fictitious variant on the example above to better illustrate how the likelihoodist, Bayesian, and frequentist approaches to statistical inference work. Suppose the prevailing survival rate on conventional therapy was 50% and that nine of first twelve patients treated with ECMO had survived. **What would likelihoodists, Bayesians, and frequentists say about the proposition that the probability of survival on ECMO is greater than the prevailing rate?**

## A Likelihoodist Treatment of the Simple Illustrative Example

Likelihoodists use likelihood functions to characterize data as evidence. Their primary interpretive tool is the Law of Likelihood, which says that $E$ favors $H_1$ over $H_2$ if and only if $\mathcal{L}=\Pr(E|H_1)/\Pr(E|H_2)$ is greater than one, with $\mathcal{L}$ measuring the degree of favoring.

The Law of Likelihood does not apply in a straightforward way to the hypothesis that the chance of survival on ECMO is greater than 50%. That hypothesis is a *composite* statistical hypotheses; that is, it is a *disjunction* of many hypotheses that do not all assign the same probability to the observed experimental result. The probability that nine out of twelve patients survive given that the probability of a given patient surviving is $p$ is well-defined for each $p$, but not for the claim that $p$ is in some finite range.

We can use the Law of Likelihood to characterize the degree to which $E$ favors the hypothesis that the probability of survival is some particular number $p>$50% over the hypothesis that it is 50%. For instance, let “$H_p$” refer to the hypothesis that the probability that a given patient survives is $p$. Then according to the Law of Likelihood, the datum $E$ that nine out of twelve patients treated with the ECMO survived favors the hypothesis $H_{75\%}$ that the probability of survival is 75% over the hypothesis $H_{50\%}$ that it is 50% to the degree $\Pr(E|H_{75\%})/\Pr(E|H_{50\%})=4.8$.

Royall (2000, 761) suggests treating a likelihood ratio of 8 as the cutoff for declaring a piece of data to be “fairly strong evidence” in favoring one hypothesis over another, and a likelihood ratio of 32 as the cutoff for “strong evidence.” By this standard, **$E$ favors for $H_{75\%}$ over $H_{50\%}$, but not to a “strong” or “fairly strong” degree.**

One could also ask about the degree to which the evidence favors $H_p$ over $H_q$ for any pair of survival rates $p$ and $q$. For instance, the Law of Likelihood says that $E$ favors $H_{75\%}$ over $H_{20\%}$ to a very high degree indeed (4457).

The plot below shows the degree to which $E$ favors $H_p$ over $H_{50\%}$ as a function of $p$, according to the Law of Likelihood.

From a likelihoodist perspective, there is no need to decide ahead of time which questions to ask, and it is completely legitimate to ask all of them simultaneously. This feature of the likelihoodist approach distinguishes it from the frequentist approach, as we will see below and in more detail next week.

## A Bayesian Treatment of the Simple Illustrative Example

Bayesians use likelihood functions to update probabilities rather than treating them as objects of interest in their own right. They contend that a rational agent has degrees of belief that conform to the axioms of probability, which he or she updates by conditioning. That is, if one learns the proposition $E$ with certainty and nothing else, then one should replace one’s prior degree of belief $\Pr(H)$ in any proposition $H$ with one’s prior degree of belief $\Pr(H|E)$ in $H$ conditional on $E$, which is given by Bayes’s theorem:

$$ \Pr(H|E)=\frac{\Pr(E|H)\Pr(H)}{\Pr(E|H)\Pr(H)+\Pr(E|\neg H)\Pr(\neg H)}$$

This updating rule has a nice connection with the Law of Likelihood: the posterior odds for a pair of hypotheses on this update rule is their prior odds times their likelihood ratio. That is,

$$\frac{\Pr(H_1|E)}{\Pr(H_2|E)}=\frac{\Pr(H_1)}{\Pr(H_2)}\frac{\Pr(E|H_1)}{\Pr(E|H_2)}$$

Now, a hypothesis like $H_{75\%}$ that posits that a continuous parameter (in this case, the chance of survival for an infant treated with ECMO) has a particular, sharp value will typically have prior probability zero. When considering such hypotheses, we need to use probability *densities*, which are the continuous analogues of discrete probability distributions. The probability that a continuous quantity is in any finite interval is given by the area under the probability density curve within that interval (or, equivalently, its integral over that interval). For instance, the figure below shows a reasonable prior probability density over the possible values of the parameter giving the chance of survival for someone who receives ECMO. The area of the blue region is the prior probability that the chance of survival is between 45% and 55%.

The equations above still hold when probabilities are replaced with probability densities. We can now see how the Bayesian approach would handle the example considered above. Using $p(H)$ rather than $\Pr(H)$ for probability densities, the continuous analogue of the odds equation tells us

$$\frac{p(H_{75\%}|E)}{p(H_{50\%}|E)}=\frac{p(H_{75\%})}{p(H_{50\%})}\frac{\Pr(E|H_{75\%})}{\Pr(E|H_{50\%})}=\frac{p(H_{75\%})}{p(H_{50\%})}\times 4.8$$

and

$$\frac{p(H_{75\%}|E)}{p(H_{20\%}|E)}=\frac{p(H_{75\%})}{p(H_{20\%})}\frac{\Pr(E|H_{75\%})}{\Pr(E|H_{20\%})}=\frac{p(H_{75\%})}{p(H_{20\%})}\times 4475$$

Suppose for the sake of illustration that one’s prior degrees of belief are appropriately represented by the figure above. Then one has

$$p(H_{20\%})=.77$$

$$p(H_{50\%})=1.4$$

$$p(H_{75\%})=1.3$$

and thus

$$\frac{p(H_{75\%})}{p(H_{50\%})}=.928$$

$$\frac{p(H_{75\%}|E)}{p(H_{50\%})|E)}=.928\times 4.8= 4.5$$

and

$$\frac{p(H_{75\%})}{p(H_{20\%})}=4475$$

$$\frac{p(H_{75\%}|E)}{p(H_{20\%})|E)}=1.7\times 4475 = 7555$$

The results for the entire posterior probability distribution are given by the orange curve in the figure below.

On a Bayesian approach, one can also assess the composite hypotheses that the chance of survival is greater than 50% and less than 50%, respectively. Again using the same probability distribution, one gets

$$\Pr(p<50\%)=.42$$

$$\Pr(p\geq 50\%)=.58$$

and

$$\Pr(p<50\%|E)=.05$$

$$\Pr(p\geq 50\%|E)=.95$$

**Thus, the data raise the probability that ECMO produces a survival rate higher than the prevailing rate on conventional therapy from $.58$ to $.95$, on the particular prior probability distribution used here.**

One great advantage of the Bayesian approach is that it tells one exactly to what degree one should believe a given hypothesis on the basis of a given piece of evidence. The great disadvantage is that it does so only relative to a given degree of belief in the hypothesis prior to receiving the evidence.

## A Frequentist Treatment of the Simple Illustrative Example

Frequentists generally reject the Law of Likelihood and the use of Bayesian probabilities in science. Their theory was originally developed and justified exclusively in terms of long-run error rates on decisions about how to behave with regard to hypotheses, rather than in terms of the degree to which the data support judgments about the alethic or epistemic value of particular hypotheses. As Neyman and Pearson put it in their original presentation of the frequentist approach, “without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in the following which we insure that, in the long run of experience, we shall not too often be wrong” (1933, 291). In practice, however, the outputs of frequentist methods are typically interpreted in terms of evidence and belief. The error-statistical philosophy developed primarily by Deborah Mayo is an ambitious attempt to develop and defend such interpretations.

A typical frequentist approach to the example under discussion would be to designate the hypothesis that ECMO is no more effective than conventional therapy (50% survival or less) the “null hypothesis” $H_0$ and to test it against the “alternative hypothesis” $H_a$ that ECMO is better than conventional therapy (greater than 50% survival). A trial would be designed to control both the probability of rejecting $H_0$ if it is true (called “the Type I error rate”) and the probability of failing to reject it if it is false to some degree that one would hate to miss (called “the Type II error rate”)—perhaps a 60% survival rate in this case. The usual approach to controlling these error rates is to choose the Type I error rate that one is willing to accept (often 5%); choose a trial design with maximum power (i.e., minimum Type II error rate) at that Type I error rate; and choose the sample size (in this case, the number of patients to treat) that makes the Type II error rate acceptably low (often 20%).

**On a frequentist approach, what one can conclude from the data depends greatly on the design of the experiment that produced the data.** Type I and Type II error rates are properties of repeatable procedures rather than of particular instances of those procedures. For this reason, frequentists generally proceed in accordance with protocols that they specify ahead of time. Otherwise, they would face often unanswerable questions about what repeatable procedure they were implementing.

An experimental protocol typically specifies both when the experimenters are to look at the data and what they are to conclude from various possible observations. If the trial protocol does not call for looking at the data after nine of the first twelve patients survived, then a frequentist cannot conclude anything from that datum. If it does call for looking at the data at that point, then what he or she can conclude may depend on when else the protocol would call for looking at the data.

For instance, suppose that the trial protocol calls for looking at the data once, after three patients have died. The most powerful test with Type I error rate no more than 5% rejects the null hypothesis in this case if and only if it takes twelve or more patients to reach three deaths. Thus, under this stopping rule, a frequentist could conclude from nine of the first twelve patients surviving that the new treatment is more effective than the old one.

On the other hand, suppose that the trial protocol calls for looking at the data once, after twelve patients have been treated. The most powerful test with Type I error rate no more than 5% rejects the null hypothesis in this case if and only if ten or more of those patients survive. Thus, under this stopping rule, a frequentist could *not* conclude from nine of the first twelve patients surviving that the new treatment is more effective than the old one.

Frequentists are not permitted to decide what questions to ask after looking at the data, except in limited ways that must be carefully prescribed ahead of time. Long-run error rates can be controlled with respect to particular questions, but not with respect to any and all questions simultaneously.

Frequentist methods’ sensitivity to “stopping rules” (i.e., the rules that tell the experimenters when to stop collecting data and draw conclusions) and to whether or not questions were predesignated is in violation of the Likelihood Principle: those factors have no effect on the likelihood function, and thus are according to the Likelihood Principle irrelevant to the evidential meaning of the data.

# Summary

Likelihoodist methods characterize the data as evidence with respect to pairs of simple statistical hypotheses. Bayesian methods use the data to update a probability distribution over the hypothesis space. Likelihoodist and Bayesian methods conform to the Likelihood Principle and fit together nicely.

Frequentist methods are rather different. Their creators regarded them not as providing assessments of the epistemic statuses of individual hypotheses, but instead as merely controlling long-run error rates. They violate the Likelihood Principle for the sake of controlling long-run error rates.

In my next post, I plan to discuss an example that brings out the difference among these approaches more clearly.

To share your thoughts about this post, comment below or send me an email. Comments support $\LaTeX$ mathematical expressions: surround with single dollar signs for in-line math or double dollar signs for display math.

Michael Lew says

Greg, welcome back to your blog! This is a very useful and clear post. I look forward to the next part.

I do have an observation, though. You set up the post of be about likelihood, Frequentism and Bayesianism as rivals, but it seems to me that they are rivals only in so far as a bicycle, a truck and a train are rivals. There are tasks for which one is much better suited than the others and, while there is some overlap in their ranges of competency, each has advantages for some tasks. The problem in the philosophy of statistics should not be one of choosing which type of inference to prefer for all purposes, but of deciding which purposes are best served by each.

Greg Gandenberger says

Yes, good point. Your analogy is nice. One point of disanalogy is that it at least makes sense to say that frequentist methods are sometimes useful but fundamentally misguided, which isn’t something you can say about a bicycle.

Rok says

Nice post, although it wouldn’t hurt if you corrected the numerical errors in the bayesian section – the probability density function intergrates to 1, but e.g. your p(H_50%) is already 1.4, and the vertical scales on the graphs there also seem to be off in the same way…

Greg Gandenberger says

Thanks for the comment! A probability density actually can exceed one. For instance, a uniform distribution over the $[0, 1/2]$ interval is 2 everywhere. Since this one is on the $[0,1]$ interval, it has to either exceed 1 somewhere or be exactly 1 everywhere (except possibly on a set of measure zero).

Rok says

Ah, silly me. You are of course correct. Thanks for pointing it out for me.

Greg Gandenberger says

No problem! I had to stop and think about it myself when I saw numbers greater than one.

Mark Holder says

Nice post.

I think you wanted to say:

“the probability of survival is some particular number $p>50$\%”

rather than:

“the probability of survival is some particular number $p>5$\%”

in the third paragraph of “A Likelihoodist Treatment of the Simple Illustrative Example”

Greg Gandenberger says

Quite right. Thanks, Mark!