# Introduction

My goal in this series of posts is to provide a short, self-contained introduction to likelihoodist, Bayesian, and frequentist methods that is readily available online and accessible to someone with no special training who wants to know what all the fuss is about.

In the first post, I give a motivating example that illustrates the enormous costs of the failure of philosophers, statisticians, and scientists to reach consensus on a reasonable, workable approach to statistical inference. I then used a fictitious variant on that example to illustrate how likelihoodist, Bayesian, and frequentist methods work in a simple case.

In the second post, I use a strange example to illustrate how likelihoodist, Bayesian, and frequentist methods can come apart.

The second post is not ideal for pedagogical purposes because the example it uses is somewhat difficult to understand without special training. **This posted is intended to illustrate some (though not all) of the same issues in a more accessible way.**

# Example

Suppose you were to take a single observation from a normally distributed random variable $X$ with unknown mean and standard deviation, yielding $X=0$. **What should you say about the mean and standard deviation of the distribution?**

For those who are not familiar with these terms, the claim that $X$ is normally distributed means (roughly) that it follows a bell-shaped curve. The mean of the curve gives the location of its peak, and the standard deviation tells how spread out the distribution is around that peak. The animation below shows how the probability distribution of $X$ varies with the mean $\mu$ and standard deviation $\sigma$.

# A Likelihoodist Perspective

The likelihoodist approach is based on the *Law of Likelihood*, which says that $X=0$ favors hypothesis $H_1$ over hypothesis $H_2$ if and only if their likelihood ratio $\mathcal{L}=p(X=0|H_1)/p(X=0|H_2)$^{1} is greater than 1, with $\mathcal{L}$ measuring the degree of favoring.

Let us fix the standard deviation at one, say, and consider what the Law of Likelihood says about hypotheses about the mean $\mu$. As one might expect, it says that $X=0$ favors $\mu=0$ over all other hypotheses of the form $\mu=\mu_0$ to a degree that increases with $|\mu_0|$. The degree to which $X=0$ favors $\mu=0$ over $\mu=\mu_0$ as a function of $\mu_0$ is shown below.

So far, so good. But now let’s fix the mean at zero, say, and consider what the Law of Likelihood says about hypotheses about the standard deviation $\sigma$. It says that $X=0$ favors $\sigma_1$ over $\sigma_2$ whenever the former is smaller than the latter, with the degree of favoring for a given value of $\sigma_2$ becoming unbounded as $\sigma_1$ goes to zero. The degree to which $X=0$ favors $\sigma=\sigma_0$ over $\sigma=1$ as a function of $\sigma_0$ is shown below (with $\mu$ assumed to be 0).

**The fact that this plot rises without bound as $\sigma_0$ goes to zero strikes many as strange.** $\sigma$ quantifies the degree of *variation* we should expect in a sequence of observations. Intuitively, we cannot learn anything about variation from a *single* observation. Thus, one observation cannot possibly tell us anything about $\sigma$. The Law of Likelihood should say that $X=0$ does not favor any value of $\sigma$ over any other.

Notice, however, that the Law of Likelihood says that $X=0$ favors $\sigma=0$ over other values of $\sigma$ only *when the mean is fixed at 0*. In other words, it says that $X=0$ favors $(\mu=0,\sigma=0)$ over $(\mu=0,\sigma=\sigma_0)$ for all $\sigma_0\neq 0$. When the mean is fixed at some other value, the Law of Likelihood says that $X=0$ favors some other value of $\sigma$ over all others. The figure below gives the likelihood function over pairs of values for $\mu$ and $\sigma$. Its global maximum is at $(\mu=0,\sigma=0)$, but its maximum as a function of $\sigma$ varies with $\mu$.

One might still find it problematic that the Law of Likelihood says not only that $X=0$ favors $(\mu=0,\sigma=0)$ over all other $(\mu,\sigma)$ pairs, but that it does so *to an infinite degree*. This result arises from the fact that if $\mu=0$ and $\sigma=0$, then $\Pr(X=0)=1$; otherwise, $\Pr(X=0)=0$. The Law of Likelihood interprets this fact as indicating that $X=0$ favors $(\mu=0,\sigma=0)$ over all other $(\mu,\sigma$) pairs to an infinite degree.

More generally, **the Law of Likelihood will always say that the data favors a hypothesis which entails that the data were bound to be what they in fact were over a hypothesis according to which the data were the result of chance.** This issue arises in an extreme way in examples like this one in which the chance hypotheses are continuous and the data are sharp, so that the probability of the *exact* datum in question given any of the chance hypotheses is zero.

One consideration that mitigates this problem in practice is that data are never sharp: all real measuring devices have finite precision. However, this fact does not address the problem as a matter of principle, nor does it address the more general issue that the Law of Likelihood will always say that the data favors a hypothesis which entails that the data were bound to be what they in fact were over a hypothesis according to which the data were the result of chance.

Two considerations are more helpful in addressing worries arising from this fact about the Law of Likelihood. First, while the Law of Likelihood will always say that the data favors the *particular* hypothesis that the data were bound to be what they in fact were over any *particular* hypothesis according to which the data were the result of chance, **it does not always say that the data favors the more generic hypothesis that the data-generating mechanism is deterministic over the more generic hypothesis that it is genuinely chancy.** In the example under discussion, it says that $X=0$ favors $(\mu=0,\sigma=0)$ over any other $(\mu,\sigma)$ pair, but not that it favors $\sigma=0$ over $\sigma=\sigma_0$ for any $\sigma_0\neq 0$. The degree to which it favors $\sigma=0$ over $\sigma=\sigma_0$ is only given relative to a prior probability distribution over $\mu$ and thus is typically not available to a likelihoodist, who is not a Bayesian precisely because he or she wants to avoid appealing to prior probability distributions.

**Second, the Law of Likelihood is an account of evidential favoring and not of belief.** It does seem reasonable to say that $E$ favors $H_1$ over $H_2$ to a maximal degree if $H_1$ entails that $E$ has probability one and $H_2$ entails that it has probability zero. As much as we might want to be able to base our degrees of belief exclusively on facts about evidential favoring, it does not follow that one should believe $H_1$ over $H_2$ in light of $E$. For a theory about what one should believe in light of the data one needs to appeal to prior probabilities.

# A Bayesian Perspective

A Bayesian treatment of this example would involve putting a prior probability distribution over the $(\mu,\sigma)$ half-plane and using the likelihood function $p(X=0|\mu,\sigma)$ to update that distribution in accordance with Bayes’s theorem:

$$p(\mu,\sigma|X=0)\propto p(\mu,\sigma)p(X=0|\mu,\sigma)$$

One might either choose the prior probability distribution for $\mu$ and $\sigma$ that represents one’s beliefs about them prior to seeing $X=0$ or choose a distribution in accordance with a formal rule. For the sake of illustration, I will consider a prior probability distribution that is uniform for $\mu$ and has an inverse-gamma distribution with parameters $\alpha=\beta=4$ for $\sigma^2$, shown below:

This distribution is “improper,” meaning that it is not a true probability distribution because it does not integrate to one. It can be thought of as the limit of proper prior probability distributions that indicate increasing degrees of indifference about $\mu$.

Updating this prior probability distribution in accordance with Bayes’s theorem involves multiplying it by the likelihood function and then renormalizing. Here is the resulting posterior probability distribution:

One can integrate $\mu$ out of the posterior probability distribution to find the posterior marginal probability distribution for $\sigma^2$. **The result is of great interest: it is the same as the prior marginal distribution.**

**In general, given a flat prior on $\mu$, learning the value of $X$ does not change a Bayesian’s degrees of belief about $\sigma$.** This result accords with the intuition that a single observation does not tell you anything about $\sigma$. At the same time, a Bayesian analysis highlights the fact that this intuition is too crude. If one is fairly certain that $\mu$ is large, for instance, then $X=0$ does favor a large value for $\sigma$ over a small one, because observations quite far from $\mu$ are more likely if $\sigma$ is large. Accordingly, learning the value of $X$ will change a Bayesian’s degrees of belief about $\sigma$ if his or her prior probability distribution on $\mu$ is not flat.

**A Bayesian analysis “fixes” the “problem” that the Law of Likelihood says that $X=0$ favors $(\mu=0,\sigma=0)$ over all other $(\mu,\sigma)$ pairs to an infinite degree by continuing to assign probability zero to $\mu$ and $\sigma$ both being exactly zero after seeing $X=0$.**

# A Frequentist Perspective

A frequentist seeks a method for drawing inference or making decisions that has good objective long-operating operating characteristics in repeated applications no matter what the truth may be. **He or she would not endorse the method of inferring $\mu=X$ and $\sigma=0$ regardless of $X$ because that method is very likely to lead to a false conclusion, unlike a naive likelihoodist who takes an infinite degree of evidential favoring for one hypothesis over each of uncountably many alternatives to warrant inferring that hypothesis.***

A frequentist would typically refuse to say anything about $\sigma$ given only a single observation. Unfortunately, he or she cannot say anything about $\mu$ either without assuming a particular value for $\sigma$. The standard frequentist method of testing a hypothesized value for the mean of a normal distribution with unknown standard deviation is a $t$-test, but that test requires at least two data points because it effectively uses a data-based estimate of the standard deviation to decide how much of a difference between the observed sample mean and the hypothesized mean to require in order to reject the hypothesized mean. Thus, a frequentist has a choice: either refuse to say anything at all in this example, or treat the standard deviation as known in testing a hypothesized value for $\mu$.

For the sake of illustration, let’s suppose that the frequentist decides to assume a standard deviation of one, perhaps on the basis of previous data from similar data-generating mechanisms. He or she would then need to specify a null hypothesis about $\mu$ to test. Frequentist generally consider either a “point null” hypothesis such as $\mu=0$ or a one-sided hypothesis such as $\mu\leq 0$. They choose the probability (often 5%) of rejecting the null hypothesis $H_0$ if it is true that they are willing to accept, and seek the test that maximizes the probability of rejecting $H_0$ if it is false consistent with that probability.

Given a one-sided null hypothesis, this approach picks out a unique test. When the null hypothesis is $\mu\leq 0$ and the frequentist is willing to accept a 5% chance of rejecting that hypothesis if it is false, the test it picks out rejects the null hypothesis if and only if the observed value of $X$ is greater than 1.64.

Given a point null hypothesis, this approach fails to pick out a unique test: which test maximizes the probability of rejecting the null hypothesis if it is false for a given probability of rejecting it if it is true depends on how false the null hypothesis is and in what direction. The standard response to this problem is to impose the natural but somewhat *ad hoc* additional requirement that the test be symmetric about the null hypothesis. When the null hypothesis is $\mu=0$ and the frequentist is willing to accept a 5% chance of rejecting that hypothesis if it is false, for instance, this approach yields the test that rejects the null hypothesis if and only if the observed value of $|X|$ is greater than 1.96.

# Conclusion

There is of course much more to say about Bayesian, likelihoodist, and frequentist methods than I have been able to address in this short introductory series. For those who want to go deeper into these topics, the first chapter of Elliott Sober’s Evidence and Evolution would be a great next step. Royall (1997), Howson and Urbach (2006), and Mayo (1996) provide good contemporary defenses of likelihoodist, Bayesian, and frequentist methods, respectively.

To share your thoughts about this post, comment below or send me an email.

Comments support $\LaTeX$ mathematical expressions: surround with single dollar signs for in-line math or double dollar signs for display math.

- The likelihood ratio here is a ratio of probability density functions rather than probabilities because the sample space is continuous. The use of continuous sample spaces raises some merely technical complications that we need not discuss here; see Hacking 1965 (57, 66-70); Berger and Wolpert 1988 (32-6); and Pawitan 2001 (23-4). ↩

## Leave a Reply