**Abstract
**Frequentist statistical methods continue to predominate in many areas of science despite prominent calls for “statistical reform.” They do so in part because their main rivals, Bayesian methods, appeal to prior probability distributions that arguably lack an objective justification in typical cases. Some methodologists find a third approach called likelihoodism attractive because it avoids important objections to frequentism without appealing to prior probabilities. However, likelihoodist methods do not provide guidance for belief or action, but only assessments of data as evidence. I argue that there is no good way to use those assessments to guide beliefs or actions without appealing to prior probabilities, and that as a result likelihoodism is not a viable alternative to frequentism and Bayesianism for statistical reform efforts in science.

Neyman and Pearson (e.g. 1933) treat the problem of choosing the best rejection region for a simple-vs.-simple hypothesis test as what computer scientists call a 0/1 knapsack problem. **Standard examples of 0/1 knapsack problems are easier to grasp than hypothesis testing problems, so thinking about Neyman-Pearson test construction on analogy with those examples is helpful for developing intuitions. It is also illuminating to think about points of disanalogy between those scenarios and hypothesis testing scenarios, which give rise to possible objections to the Neyman-Pearson approach.**

In a knapsack problem, one seeks to **maximize some quantity subject to a constraint.** A standard example is that of a thief who wants to maximize the value of the objects she steals from a particular home, subject to the constraint that the total weight of those objects cannot be greater than the maximum weight that she can carry. For instance, suppose the thief has the following items to choose from.^{1}

Clock | Painting | Radio | Vase | Book | Computer | |
---|---|---|---|---|---|---|

Value (USD) | 175 | 90 | 20 | 50 | 10 | 200 |

Weight (lb.) | 10 | 9 | 4 | 2 | 1 | 20 |

Value/Weight (USD/lb.) | 17.5 | 10 | 5 | 25 | 10 | 10 |

One possible approach to this problem is to **choose objects in order of descending value/weight ratio** until adding the next object would cause the total weight to exceed the limit. In this example, given enough space, that approach would lead one to choose first the vase, then the clock, then either the computer, the book, or the painting, and so on. **This approach has the following virtue: it yields a set of objects that has at least as much value as any other set with the same or less total weight.** However, there may be a set of objects that has greater value within the maximum weight limit. For instance, if the maximum weight is 10 lb., then this approach would lead one to take the vase only, because the next object, the clock, would put one over that weight limit. This choice provides more value than any other choice with the same total weight (2lb.). However, there are other choices with total weight less than 10 lbs. and greater value: for instance, one could take just the clock, or the vase, radio, and book.

In the 0/1 knapsack problem, each item is either in the knapsack or not. An easier problem is the **continuous knapsack program**, in which objects can be arbitrarily broken up into smaller objects, preserving the ratios of their basic attributes. For instance, if the objects were things like gold bullion and crude oil, the thief might be able to take arbitrary quantities of those items at a fixed value/weight ratio. The optimal solution to the thief’s problem in this case would be to fill up on each item as much as possible in order of descending value/weight ratio, stopping precisely when the maximum weight is reached.

Suppose one wanted to test a null hypothesis $H_0$ against an alternative hypothesis $H_a$. In the simplest case, $H_0$ and $H_a$ are both “simple statistical hypotheses” relative to some proposed experiment, meaning that they each specify a particular chance distribution over the sample space **S** of possible outcomes of that experiment. **Our task is to decide which elements of S to place in the “rejection region” R**, that is, the precise set of results on which will reject $H_0$ for $H_a$.

Neyman and Pearson propose to choose a test on the basis of *power* and *Type I error rate*, where a test’s power is the probability that it correctly rejects $H_0$ if $H_0$ is false, and its Type I error rate is the probability that it incorrectly rejects $H_0$ if $H_0$ is true. Specifically, they propose to choose a test that maximizes power subject to the constraint that the Type I error rate cannot exceed some maximum value $\alpha$. Thus, **they treat the problem of constructing a hypothesis test as a 0/1 knapsack problem**, completely analogous to the thief’s problem described above, as shown in this table.

Literal Knapsack Problem | Hypothesis Test Construction |
---|---|

Putting item into knapsack | Putting element of S into rejection region R |

Total value | Power (sum of $\Pr(s;H_a)$ over elements of S in R) |

Total weight | Type I error rate (sum of $\Pr(s;H_0)$ over elements of S in R) |

Maximizing total value subject to maximum total weight | Maximizing power subject to maximum Type I error rate |

Consider the example shown in the table below. $s_1$, $s_2$, and $s_3$ are elements of a sample space **S**. They could be, for instance, the event that a three-sided die produces a 1, 2, or 3, respectively. $H_0$ and $H_a$ would then be hypotheses about the biases of the die.

$s_1$ | $s_2$ | $s_3$ | |
---|---|---|---|

$\Pr(s;H_a)$ | 0.04 | 0.05 | 0.91 |

$\Pr(s;H_0)$ | 0.01 | 0.05 | 0.94 |

$\Pr(s;H_a)/\Pr(s;H_0)$ | 4 | 1 | 0.97 |

I said above that putting objects into the knapsack in descending order by value/weight ratio, stopping when the next item would cause the total weight to exceed the limit, yields a set of items that has the largest value among all sets with no more than its total weight. Analogously, **putting elements of the sample space into the rejection region in order by descending likelihood ratio, stopping when the next item would cause the Type I error rate to exceed $\alpha$, yields a rejection region that has the greatest power among all possible rejection regions with no more than its Type I error rate.** (This result is known as the Neyman-Pearson lemma.) Just as that approach in the thief’s case may not yield the greatest possible value consistent with the cap on the total weight, so too in the hypothesis testing case **it may not yield the greatest possible power consistent with the Type I error rate being no greater than $\alpha$.** For instance, in the example shown above, it would lead one to perform a test that has power $.04$ when $0.05\leq \alpha < .06$ (with **R**={$s_1$}) even though a test with power $.05$ and Type I error rate less than $\alpha$ is available (with **R** ={$s_2$}).

**There are two ways to turn the 0/1 knapsack problem of constructing a best Neyman-Pearson hypothesis test into a continuous knapsack problem.** First, one can consider cases with continuous, strictly positive probability distributions over continuous sample spaces. Here, the optimal solution is to add elements of the hypothesis space to the rejection region in descending order by likelihood ratio until the Type I error rate reaches $\alpha$. Second, one can allow *randomized* tests that reject the null hypothesis with some non-extremal probability on some elements of the sample space. Here, the optimal solution is to add elements of the sample space to the rejection region in descending order by likelihood ratio until we get to the first element that would cause the Type I error rate to exceed $\alpha$ if we were to add it to **R**. We then prescribe consulting some auxiliary randomizer to decide to reject the null hypothesis if that result is observed, in such a way that the Type I error rate of the test is exactly $\alpha$. This procedure is analogous to having the thief taking a portion of the item with the largest value/weight ratio that will not wholly fit in the bag, choosing the size of the portion so that the total weight is exactly the maximum weight.

Randomized tests are often discussed in presentations of the Neyman-Pearson framework because they make certain results easier to state. However, they are generally rejected in practice. They violate the plausible principle that the output of a hypothesis test should depend only on aspects of the data that are evidentially relevant to the hypotheses in question. One could take the hardline view suggested by Neyman and Pearson’s own writings that this principle is false because only long-run error rates matter. However, few methodologists take this view so seriously that they are willing to countenance randomized tests.^{2}

**The Neyman-Pearson approach of treating hypothesis test construction as a knapsack problem has some odd consequences.** For instance, in the example above, the optimal solution for $.05\leq \alpha < .06$ rejects $H_0$ if and only if $s_2$ is observed. But $s_2$ has the same probability ($.05$) under $H_0$ and $H_a$, whereas $s_1$ is four times more probable under $H_a$ than under $H_0$. If one accepts the Law of Likelihood, which says that $s$ favors $H_a$ over $H_0$ if and only if $\Pr(s;H_a)/\Pr(s;H_0)>1$, then it follows that $s_1$ favors $H_a$ over $H_0$ while $s_2$ is neutral between them. Even if the Law of Likelihood is not acceptable in full generality, it seems to give a sensible verdict in this case. One might think, then, that $s_2$ should not appear in **R** without $s_1$.

**Pearson provides a solution to this problem in later papers.** In his (1947, 173), for instance, he prescribes a three-step process for specifying tests:

Step 1. We must specify the [sample space].

Step 2. We then divide this set by a system of ordered boundaries or contours such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined, on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.

Step 3. We then, if possible, associate with each contour level [a Type I error rate].

The key point here is that **Pearson prescribes ordering results according to the degree to which they would incline one to reject $H_0$ and only considering tests that reject $H_0$ on result $s$ but not result $s’$ if $s$ makes one more inclined to reject $H_0$ than $s’$.** If the ordering of one’s inclinations to reject $H_0$ on the basis of possible observations conforms to the likelihood ratios of those observations, then on this approach one will not consider problematic tests like the one that rejects on $s_2$ but not $s_1$ in our example. This approach is analogous to the thief always taking items in descending order by value/weight ratio, stopping when the next item will not fit. **Here we see a point of disanalogy between hypothesis test construction and literal knapsack problems:** because the outcomes of hypothesis tests are (at least *de facto*) interpreted in evidential terms, it seems inappropriate to add elements of the sample space to **S** “out of order” relative to their likelihood ratios, even if doing so allows one to get greater power while keep the Type I error rate below $\alpha$. By contrast, it is not problematic to add items to the thief’s knapsack “out of order” relative to their value/weight ratios if doing so allows one to get a higher total value while keeping the maximum weight below the maximum.

**A second point of disanalogy between a literal knapsack problem and the problem of constructing a hypothesis test concerns the appropriate way to trade off value against weight, or power against Type I error rate.** We can arrange so that it is only a slight idealization to suppose that the thief does not care how heavy her bag is as long as she can carry it away. (We can assume that she has a strong back and a getaway vehicle nearby, does not have to worry about how much noise she makes, and so on.) We cannot arrange so that a scientist does not care about the Type I error rate of his or her test as long as it is below a particular threshold, at least if we impose the normative assumption that the scientist’s goal is to advance knowledge and not just, say, to get his or her paper past journal referees.

**Rather than maximizing power subject to a maximum Type I error rate, it would seem to make more sense to minimize a weighted sum of the Type I and Type II error rates,** where the Type II error rate is the probability of failing to reject the null hypothesis if it is false (1-power) and the weights reflect the importance of avoiding Type I and Type II errors. Like Pearson’s approach if one’s inclinations to reject $H_0$ conform to likelihood ratios, this approach would lead one to reject $H_0$ for $H_a$ if and only if the likelihood ratio $\Pr(s;H_a)/\Pr(s;H_0)$ exceeds some threshold $k$. In this case, $k$ is simply the weight one associates with the Type I error rate divided by the weight one associates with the Type II error rate. **The only difference between this approach and Pearson’s is that this approach involves fixing relative weights on the Type I error rate and power and letting the likelihood ratio cutoff for rejection and the Type I error rate fall where they may,** whereas Pearson’s involves putting a cap on the Type I error rate and letting the likelihood ratio cutoff and (implied) relative weights on the Type I error rate and power fall where they may.

One might object that weights on the Type I and Type II error rates are too subjective or arbitrary for use in science. However, they do not seem to be any more subjective or arbitrary than the maximum tolerated Type I error rate $\alpha$. There is a typical convention of setting $\alpha=.05$, but that convention is itself rather arbitrary. Moreover, we could establish the analogous convention of setting $k=20$, which has the effect of guaranteeing that the Type I error rate is no greater than the standard .05 (Royall 2000).

**To my mind, this alternative to Pearson’s approach seems more sensible.** Constructing a hypothesis test is more like filling a knapsack for a long journey than for a quick getaway: every increase in weight (Type I error rate) matters and needs to be compensated by a sufficient increase in value (power).

The main problem for this approach is that it does not generalize well to cases involving composite hypotheses (e.g., that a particular parameter is in a specified range or is not equal to a specific value), which are the usual cases in science. In those cases one or both of $\Pr(s;H_0)$ and $\Pr(s;H_a)$ lack definite values, and many methodologists are reluctant to appeal to the corresponding conditional probabilities $\Pr(s|H_0)$ and $\Pr(s|H_a)$ because they lack generally accepted objective values.

To share your thoughts about this post, comment below or send me an email.

Comments support $\LaTeX$ mathematical expressions: surround with single dollar signs for in-line math or double dollar signs for display math.

- This example comes from Lecture 8 of the edX course MITx 6.00.2x. ↩
- Lehmann is perhaps the most prominent contemporary methodologists who takes randomized tests seriously; see Lehmann and Romano 2005, Ch. 15 ↩

I’m organizing a group to work through the materials from Andrew Ng’s Stanford course CS 229: Machine Learning. This course is a more advanced version of the most popular course on Coursera, which is widely recommended as the best way to get started in machine learning.

I have set up a Slack instance for participants to ask each other questions, post solutions, etc. I will also organize a weekly Google hangout. (Participation in the hangout is optional.) We will start Nov. 30. There will be a suggested schedule, but you are welcome to work at your own pace. There is no harm in having people signed up but not actively involved, so feel free to join even if you are not sure how much you will participate.

As of right now ~~13~~ 21 people have signed up, so we should have a nice group. If you are interested, send me your email address so I can invite you to the Slack instance. If I don’t already know you, please also tell me who you are and why you are interested in the course.

*Neural network image at the top of this post is in the public domain.*

**Bayesians generally reject the frequentist view that inference and decision procedures should be sensitive to differences among “stopping rules”**—that is, the (possibly implicit) processes by which experimenters decide when to stop collecting the data that will be fed into those procedures—outside of unusual cases in which the stopping rule is “informative” in a technical sense.

Frequentists often argue for their position by claiming that **ignoring differences among noninformative stopping rules would allow experimenters to produce systematically misleading results**. For instance, Mayo and Kruse (2001) consider the case of a subject who claims to be able to predict draws from a deck of ESP cards. On a frequentist approach, if a 5% significance level is used and the data are treated as if the sample size had been fixed in advance, then the probability of rejecting the “null hypothesis” that the subject has no extrasensory abilities within the first 1000 observations if that hypothesis is true is 53%, and the probability of rejecting it within some finite number of observations is one. Accordingly, Mayo and Kruse claim that whether the experimenter had planned to stop after 1000 trials all along or had planned to stop as soon as a statistically significant outcome had occurred must be reported and must be taken into account in inference and decision.

Bayesians have responded to this argument by showing out that **their approach does not allow such “reasoning to a foregone conclusion”** as long as the prior probability distributions that are used are countably additive (Kadane et al. 1996). In fact, the probability that a given experiment will produce a result that would lead a Bayesian agent to increase his or her odds in a particular hypothesis $H_a$ against a different hypothesis $H_0$ by a factor of $k$ is at most $1/k$ when $H_0$ is true, regardless of the experiment’s stopping rule.

Mayo and Kruse claim that **this response misses the point** of their objection, which does not require that the probability of being misled can be arbitrarily high, but only that it can be increased if stopping rules are ignored. Even on a fully Bayesian approach, disingenuous experimenters can tilt the odds in favor of their preferred hypothesis through their choice of stopping rule.

Bayesians often respond to this objection by saying that **the probability that an experiment will produce a misleading result is an issue of experimental design only** and is thus irrelevant to questions about inference or decision in light of the data.

There seems to be a good response to this Bayesian claim that I have not yet encountered: issues of inference or decision cannot be separated from issues of experimental design when choices regarding the former may influence choices regarding the latter and the interests of those making the two kinds of choices are not aligned. For instance, consider the position of a government regulatory agency such as the FDA. The FDA has reason to adopt more or less explicit and consistent inference or decision rules regarding, for instance, when to approve a drug. If the FDA foresees that a certain policy would lead pharmaceutical companies to choose stopping rules that the FDA regards as undesirable in order to tilt the odds of getting desired decisions in their favor, then that fact is a reason for them not to adopt that policy. **In this kind of case, issues of inference or decision and issues of experimental design are conceptually distinct but decision-theoretically entangled and thus cannot be treated separately.**

Consider a simplified case in which a scientist can perform either a test with a fixed sample size of $n$ or a test that will continue until either the likelihood ratio of $H_a$ against $H_0$ exceeds some number $l$ or some maximum sample size $m>n$ is reached. A regulator has to decide what likelihood ratio $l_f$ would suffice for rejecting $H_0$ if the fixed-sample experiment is performed and what likelihood ratio $l_t$ would suffice if the target-likelihood-ratio procedure were performed. If there are no concerns about the regulator’s choice influencing the experimental design, then the regulator should set $l_f=l_t$, in accordance with the fact that the difference between the noninformative stopping rules in question does not affect the evidential import of the data according to the Likelihood Principle and does not affect the posterior probabilities under Bayesian conditioning. However, if the scientist can take the regulator’s choices for $l_f$ and $l_t$ into account in designing his experiment and the regulator prefers to reject $H_0$ only if it is false while the scientist prefers for it to be rejected no matter what, then under typical circumstances the regulator maximizes her expected utility by setting $l_t>l_f$ to avoid incentivizing undesirable behavior by the scientist. (A demonstration of this result is available upon request.) Thus, under these circumstances **Bayesian principles entail that the regulator should act in accordance with the frequentist idea** that differences among noninformative stopping rules are relevant to inference or decision.

**This result is not an idle curiosity:** government regulators, scientific journal editors, science journalists, scientific societies, evidence-based practitioners, and even the general public can greatly affect decisions about experimental design through their choices of inferential and decision-making practices. **It is no objection to Bayesianism per se.** Bayesian arguments for the irrelevance of stopping rules to inference and decision all assume that issues of experimental design can be treated separately from issues of inference and decision, and that assumption breaks down in the kinds of cases in question. In fact, the result is actually useful for defending basic Bayesian principles because it shows that those principles recover frequentist intuitions about stopping rules in precisely the kinds of cases in which those intuitions are most plausible. On the other hand, the result indicates that

To share your thoughts about this post, comment below or send me an email.

Comments support $\LaTeX$ mathematical expressions: surround with single dollar signs for in-line math or double dollar signs for display math.

Zener cards image was created by Mikhail Ryazanov. It is used here under the Creative Commons Attribution-Share Alike 3.0 Unported license. Its use here does not imply that its creator endorses any of the positions taken here.

]]>My conversations at the Munich Center for Mathematical Philosophy keep coming back to stopping rules, so I’ve decided to write a paper on the topic. Here is the general line that I plan to develop.

Abstract.One might think that a scientist should not be allowed to plan to end an experiment as soon as it produces data that favors his or her favored hypothesis. At the very least, a scientist who proceeds in such a “biased” way is obliged to report this fact and to account for it in his or her data analysis. I argue that these seeming platitudes are warranted only by contingent facts about the non-ideal ways in which data are typically disseminated and used. If we were perfect Bayesian decision-makers and data went missing only at random, for instance, then “biased” stopping rules would be unproblematic. In a less Utopian vein, attention to stopping rules becomes less important as institutional and technological advances allow us to approach those ideals in particular domains.

A key part of the argument is that **“biased” stopping rules always have tradeoffs**. For instance, there are simple strategies for increasing the probability of producing a result that favors your preferred hypothesis over a specified alternative to a particular degree (according to the Law of Likelihood). However, those strategies also *decrease* the probability of getting a result that* strongly* favors your preferred hypothesis and* increase* the probability of getting a result that *very strongly disfavors* it. There are general results (usually presented in the context of gambling strategies) that guarantee that something like this will always be the case. As a result, it is misleading to speak of stopping rules as biased or unbiased: one stopping rule can be more biased than another in a particular respect, but it must then be less biased in other respects. The details matter, but it is at least plausible that the existence of such tradeoffs is sufficient to address the main frequentist objections to ignoring stopping rules, which is that doing so would allow unscrupulous researchers to produce systematically misleading results.

**These tradeoffs are cold comfort in the presence of certain non-ideal practices concerning the use and dissemination of data.** For instance, if scientists are able to suppress results that do not support their preferred conclusions, and decisions to accept or reject one hypothesis relative to another are made once and for all on the basis of whether or not a threshold for evidential favoring is reached–rather than in a dynamical way that attends to the precise degree of evidential favoring–then ignoring stopping rules can be disastrous. Unfortunately, such selective reporting and threshold reasoning (e.g. $p<.05$) are ubiquitous in many areas of science.

The problem of selective reporting could be addressed to a large extent through the use of pretrial registries. The problem of once-and-for-all decision thresholds may be more difficult to eliminate, particularly in domains such as medical research in which decision-making power is largely delegated to authorities that are accountable to the public and thus have reason to proceed in a relatively transparent, stable, and “objective” manner. It could be less of a problem when the interests of the relevant parties are more aligned, such as in a business’s use of its own internal data.

My main claims are that (1) as a foundational matter, **the Likelihood Principle’s implication that stopping rules are irrelevant to the evidential import of data (provided that they are “noninformative” in a technical sense) is defensible,** and (2) **as a practical matter, attention to stopping rules may or may not be necessary in a given domain depending on how the data are disseminated and used.**

To share your thoughts about this post, comment below or send me an email.

Comments support $\LaTeX$ mathematical expressions: surround with single dollar signs for in-line math or double dollar signs for display math.

]]>**Having regularly repeating routines allows you to stay on top of preventative maintenance and other kinds of tasks that are important but in danger of being neglected because they are never urgent.** It also gives you “slots” in which to insert new tasks that you want to start performing on a regular basis but wouldn’t otherwise know how to manage. Having a separate repeating event in a digital calendar for each task is another option, but it is difficult on that approach to keep the whole system up to date.

Combine regularly scheduled routines with a calendar for tasks that have specific deadlines and the GTD approach for tasks that you want to get to as soon as you’re able for a powerful approach to staying on top of everything you need to do. Routines can largely take the place of habit-forming systems such as Lift and are easier to maintain and more reliable. (But Lift is still useful for more situational habits that can’t be scheduled.) **Instead of trying to change yourself so that you are internally motivated to exercise every day, for instance, just put “exercise” on a list of things to do at a specific time each day.**

There are many possible ways to implement a set of regularly repeating routines. **My own approach is to keep a to-do list for each of these routines in Workflowy.** For each daily routine, I have a folder of favorites in my bookmarks bar in Google Chrome. At the beginning of the day, for instance, I middle-click on my Morning folder to open my Morning routine list and supporting materials such as my Google calendar in separate tabs. For the routines that repeat less often, it wouldn’t be worth the effort to keep favorites folders up to date. I just put links into those lists for supporting materials that I open in separate tabs as I go along. I do my weekly routines on Fridays and Saturdays and have repeating events in my Google calendar that prompt me to do the routines that repeat less often.

It’s important to review your routines regularly to keep them relevant, organized, and short. I review my daily routines as part of my weekly routine, my weekly routine as part of my monthly routine, and so on. Each time **I look for things that I can eliminate, automate, delegate, do less often, or batch together with similar tasks.**

Here are examples of items I include in my routines.

- Get a large glass of water
- Review my plan for the day (another Workflowy item) and set up my workspace for my first task
- Exercise
- Eat breakfast
- Brush teeth
- Shower
- Start work

- Make a plan for the next day
- Process high-volume inboxes (including email, feed reader, pocket notebook, and physical inbox). (This is a great way to get to Inbox Zero and stay there without constantly monitoring your messages. Note that “process” doesn’t necessarily mean “respond!” It’s OK to defer appropriate actions until later as long as you put an appropriate reminder somewhere in your task-management system.)
- Make calls, send texts and emails. (I add items to these lists as they occur to me throughout the day, then get them all done at once.)
- Review Anki flashcards
- Set up my workspace for my next work task
- Take a walk, grab a snack, and get back to work

- Record that day’s word count and other accomplishments
- Update plan for the next day
- Set alarms for the next day
- Unpack bags from today and pack for tomorrow
- Set up workspace for tomorrow

- Refine daily routines
- Deposit checks (with mobile app)
- Review last week’s transactions, update that month’s budget
- Do miscellaneous jobs around house (take out trash, etc.)
- Do GTD weekly review
- Process medium-volume inboxes (e.g. new research sources)
- Check job ads, update spreadsheet
- Check library records for upcoming due dates and available requests, plan library visit for that week if needed
- Look at ahead to next week on my calendar.

- Refine weekly routine
- Do computer maintenance (update anti-malware software, run scans, physical cleaning, etc.)
- Process lower-volume inboxes (e.g. items I have clipped to Evernote)
- Check tire tread and pressure, mileage for next oil change
- Check department calendar for upcoming events
- Clean up internet bookmarks, podcasts, etc.
- Check recent eTOC email alerts (I use a Gmail filter to archive and label these messages automatically)
- Look ahead to next month on annual planning calendar
- “80/20” analysis: which activities are driving most of my results? which could I eliminate at little cost?

- Refine monthly routine
- Do more computer maintenance (uninstall programs I haven’t been using, run Disk Utility, etc.)
- Replace toothbrushes
- Wash car
- Update Tools page
- Update social media profiles

- Refine quarterly routine
- Do an annual review
- Make an annual planning calendar for the next year
- Purge computer and physical files
- Schedule a physical
- Review insurance policies
- Change passwords
- Refine yearly routine for next year

To share your thoughts about this post, comment below or send me an email.

My goal in this series of posts is to provide a short, self-contained introduction to likelihoodist, Bayesian, and frequentist methods that is readily available online and accessible to someone with no special training who wants to know what all the fuss is about.

In the first post, I give a motivating example that illustrates the enormous costs of the failure of philosophers, statisticians, and scientists to reach consensus on a reasonable, workable approach to statistical inference. I then used a fictitious variant on that example to illustrate how likelihoodist, Bayesian, and frequentist methods work in a simple case.

In the second post, I use a strange example to illustrate how likelihoodist, Bayesian, and frequentist methods can come apart.

The second post is not ideal for pedagogical purposes because the example it uses is somewhat difficult to understand without special training. **This posted is intended to illustrate some (though not all) of the same issues in a more accessible way.**

Suppose you were to take a single observation from a normally distributed random variable $X$ with unknown mean and standard deviation, yielding $X=0$. **What should you say about the mean and standard deviation of the distribution?**

For those who are not familiar with these terms, the claim that $X$ is normally distributed means (roughly) that it follows a bell-shaped curve. The mean of the curve gives the location of its peak, and the standard deviation tells how spread out the distribution is around that peak. The animation below shows how the probability distribution of $X$ varies with the mean $\mu$ and standard deviation $\sigma$.

The likelihoodist approach is based on the *Law of Likelihood*, which says that $X=0$ favors hypothesis $H_1$ over hypothesis $H_2$ if and only if their likelihood ratio $\mathcal{L}=p(X=0|H_1)/p(X=0|H_2)$^{1} is greater than 1, with $\mathcal{L}$ measuring the degree of favoring.

Let us fix the standard deviation at one, say, and consider what the Law of Likelihood says about hypotheses about the mean $\mu$. As one might expect, it says that $X=0$ favors $\mu=0$ over all other hypotheses of the form $\mu=\mu_0$ to a degree that increases with $|\mu_0|$. The degree to which $X=0$ favors $\mu=0$ over $\mu=\mu_0$ as a function of $\mu_0$ is shown below.

So far, so good. But now let’s fix the mean at zero, say, and consider what the Law of Likelihood says about hypotheses about the standard deviation $\sigma$. It says that $X=0$ favors $\sigma_1$ over $\sigma_2$ whenever the former is smaller than the latter, with the degree of favoring for a given value of $\sigma_2$ becoming unbounded as $\sigma_1$ goes to zero. The degree to which $X=0$ favors $\sigma=\sigma_0$ over $\sigma=1$ as a function of $\sigma_0$ is shown below (with $\mu$ assumed to be 0).

**The fact that this plot rises without bound as $\sigma_0$ goes to zero strikes many as strange.** $\sigma$ quantifies the degree of *variation* we should expect in a sequence of observations. Intuitively, we cannot learn anything about variation from a *single* observation. Thus, one observation cannot possibly tell us anything about $\sigma$. The Law of Likelihood should say that $X=0$ does not favor any value of $\sigma$ over any other.

Notice, however, that the Law of Likelihood says that $X=0$ favors $\sigma=0$ over other values of $\sigma$ only *when the mean is fixed at 0*. In other words, it says that $X=0$ favors $(\mu=0,\sigma=0)$ over $(\mu=0,\sigma=\sigma_0)$ for all $\sigma_0\neq 0$. When the mean is fixed at some other value, the Law of Likelihood says that $X=0$ favors some other value of $\sigma$ over all others. The figure below gives the likelihood function over pairs of values for $\mu$ and $\sigma$. Its global maximum is at $(\mu=0,\sigma=0)$, but its maximum as a function of $\sigma$ varies with $\mu$.

One might still find it problematic that the Law of Likelihood says not only that $X=0$ favors $(\mu=0,\sigma=0)$ over all other $(\mu,\sigma)$ pairs, but that it does so *to an infinite degree*. This result arises from the fact that if $\mu=0$ and $\sigma=0$, then $\Pr(X=0)=1$; otherwise, $\Pr(X=0)=0$. The Law of Likelihood interprets this fact as indicating that $X=0$ favors $(\mu=0,\sigma=0)$ over all other $(\mu,\sigma$) pairs to an infinite degree.

More generally, **the Law of Likelihood will always say that the data favors a hypothesis which entails that the data were bound to be what they in fact were over a hypothesis according to which the data were the result of chance.** This issue arises in an extreme way in examples like this one in which the chance hypotheses are continuous and the data are sharp, so that the probability of the *exact* datum in question given any of the chance hypotheses is zero.

One consideration that mitigates this problem in practice is that data are never sharp: all real measuring devices have finite precision. However, this fact does not address the problem as a matter of principle, nor does it address the more general issue that the Law of Likelihood will always say that the data favors a hypothesis which entails that the data were bound to be what they in fact were over a hypothesis according to which the data were the result of chance.

Two considerations are more helpful in addressing worries arising from this fact about the Law of Likelihood. First, while the Law of Likelihood will always say that the data favors the *particular* hypothesis that the data were bound to be what they in fact were over any *particular* hypothesis according to which the data were the result of chance, **it does not always say that the data favors the more generic hypothesis that the data-generating mechanism is deterministic over the more generic hypothesis that it is genuinely chancy.** In the example under discussion, it says that $X=0$ favors $(\mu=0,\sigma=0)$ over any other $(\mu,\sigma)$ pair, but not that it favors $\sigma=0$ over $\sigma=\sigma_0$ for any $\sigma_0\neq 0$. The degree to which it favors $\sigma=0$ over $\sigma=\sigma_0$ is only given relative to a prior probability distribution over $\mu$ and thus is typically not available to a likelihoodist, who is not a Bayesian precisely because he or she wants to avoid appealing to prior probability distributions.

**Second, the Law of Likelihood is an account of evidential favoring and not of belief.** It does seem reasonable to say that $E$ favors $H_1$ over $H_2$ to a maximal degree if $H_1$ entails that $E$ has probability one and $H_2$ entails that it has probability zero. As much as we might want to be able to base our degrees of belief exclusively on facts about evidential favoring, it does not follow that one should believe $H_1$ over $H_2$ in light of $E$. For a theory about what one should believe in light of the data one needs to appeal to prior probabilities.

A Bayesian treatment of this example would involve putting a prior probability distribution over the $(\mu,\sigma)$ half-plane and using the likelihood function $p(X=0|\mu,\sigma)$ to update that distribution in accordance with Bayes’s theorem:

$$p(\mu,\sigma|X=0)\propto p(\mu,\sigma)p(X=0|\mu,\sigma)$$

One might either choose the prior probability distribution for $\mu$ and $\sigma$ that represents one’s beliefs about them prior to seeing $X=0$ or choose a distribution in accordance with a formal rule. For the sake of illustration, I will consider a prior probability distribution that is uniform for $\mu$ and has an inverse-gamma distribution with parameters $\alpha=\beta=4$ for $\sigma^2$, shown below:

This distribution is “improper,” meaning that it is not a true probability distribution because it does not integrate to one. It can be thought of as the limit of proper prior probability distributions that indicate increasing degrees of indifference about $\mu$.

Updating this prior probability distribution in accordance with Bayes’s theorem involves multiplying it by the likelihood function and then renormalizing. Here is the resulting posterior probability distribution:

One can integrate $\mu$ out of the posterior probability distribution to find the posterior marginal probability distribution for $\sigma^2$. **The result is of great interest: it is the same as the prior marginal distribution.**

**In general, given a flat prior on $\mu$, learning the value of $X$ does not change a Bayesian’s degrees of belief about $\sigma$.** This result accords with the intuition that a single observation does not tell you anything about $\sigma$. At the same time, a Bayesian analysis highlights the fact that this intuition is too crude. If one is fairly certain that $\mu$ is large, for instance, then $X=0$ does favor a large value for $\sigma$ over a small one, because observations quite far from $\mu$ are more likely if $\sigma$ is large. Accordingly, learning the value of $X$ will change a Bayesian’s degrees of belief about $\sigma$ if his or her prior probability distribution on $\mu$ is not flat.

**A Bayesian analysis “fixes” the “problem” that the Law of Likelihood says that $X=0$ favors $(\mu=0,\sigma=0)$ over all other $(\mu,\sigma)$ pairs to an infinite degree by continuing to assign probability zero to $\mu$ and $\sigma$ both being exactly zero after seeing $X=0$.**

A frequentist seeks a method for drawing inference or making decisions that has good objective long-operating operating characteristics in repeated applications no matter what the truth may be. **He or she would not endorse the method of inferring $\mu=X$ and $\sigma=0$ regardless of $X$ because that method is very likely to lead to a false conclusion, unlike a naive likelihoodist who takes an infinite degree of evidential favoring for one hypothesis over each of uncountably many alternatives to warrant inferring that hypothesis.***

A frequentist would typically refuse to say anything about $\sigma$ given only a single observation. Unfortunately, he or she cannot say anything about $\mu$ either without assuming a particular value for $\sigma$. The standard frequentist method of testing a hypothesized value for the mean of a normal distribution with unknown standard deviation is a $t$-test, but that test requires at least two data points because it effectively uses a data-based estimate of the standard deviation to decide how much of a difference between the observed sample mean and the hypothesized mean to require in order to reject the hypothesized mean. Thus, a frequentist has a choice: either refuse to say anything at all in this example, or treat the standard deviation as known in testing a hypothesized value for $\mu$.

For the sake of illustration, let’s suppose that the frequentist decides to assume a standard deviation of one, perhaps on the basis of previous data from similar data-generating mechanisms. He or she would then need to specify a null hypothesis about $\mu$ to test. Frequentist generally consider either a “point null” hypothesis such as $\mu=0$ or a one-sided hypothesis such as $\mu\leq 0$. They choose the probability (often 5%) of rejecting the null hypothesis $H_0$ if it is true that they are willing to accept, and seek the test that maximizes the probability of rejecting $H_0$ if it is false consistent with that probability.

Given a one-sided null hypothesis, this approach picks out a unique test. When the null hypothesis is $\mu\leq 0$ and the frequentist is willing to accept a 5% chance of rejecting that hypothesis if it is false, the test it picks out rejects the null hypothesis if and only if the observed value of $X$ is greater than 1.64.

Given a point null hypothesis, this approach fails to pick out a unique test: which test maximizes the probability of rejecting the null hypothesis if it is false for a given probability of rejecting it if it is true depends on how false the null hypothesis is and in what direction. The standard response to this problem is to impose the natural but somewhat *ad hoc* additional requirement that the test be symmetric about the null hypothesis. When the null hypothesis is $\mu=0$ and the frequentist is willing to accept a 5% chance of rejecting that hypothesis if it is false, for instance, this approach yields the test that rejects the null hypothesis if and only if the observed value of $|X|$ is greater than 1.96.

There is of course much more to say about Bayesian, likelihoodist, and frequentist methods than I have been able to address in this short introductory series. For those who want to go deeper into these topics, the first chapter of Elliott Sober’s Evidence and Evolution would be a great next step. Royall (1997), Howson and Urbach (2006), and Mayo (1996) provide good contemporary defenses of likelihoodist, Bayesian, and frequentist methods, respectively.

To share your thoughts about this post, comment below or send me an email.

- The likelihood ratio here is a ratio of probability density functions rather than probabilities because the sample space is continuous. The use of continuous sample spaces raises some merely technical complications that we need not discuss here; see Hacking 1965 (57, 66-70); Berger and Wolpert 1988 (32-6); and Pawitan 2001 (23-4). ↩

The major virtues and vices of Bayesian, frequentist, and likelihoodist approaches to statistical inference.# Introduction

In the previous post, I gave a motivating example that illustrates the enormous costs of the failure of philosophers, statisticians, and scientists to reach consensus on a reasonable, workable approach to statistical inference. I then used a fictitious variant on that example to illustrate how likelihoodist, Bayesian, and frequentist methods work in a simple case. **In this post, I discuss a stranger case that better illustrates how likelihoodist, Bayesian, and frequentist methods come apart.** This post is considerably more technical than the previous one, and **I fear that those with no special training will find it tough going.** I would love to get feedback on how I can make it more accessible. For those who want to go deeper into these topics, the first chapter of Elliott Sober’s Evidence and Evolution would be a great next step. Royall (1997), Howson and Urbach (2006), and Mayo (1996) provide good contemporary defenses of likelihoodist, Bayesian, and frequentist methods, respectively.

Statistical inference is an attempt to evaluate a set of probabilistic hypotheses about the behavior of some data-generating mechanism. It is perhaps the most tractable and well-studied kind of inductive inference. The three leading approaches to statistical inference are Bayesian, likelihoodist, and frequentist. All three use likelihood functions, where the likelihood function for a datum $E$ on a set of hypotheses **H** is $\Pr(E|H)$ (the probability of $E$ given $H$) considered is a function of $H$ as it varies over the set **H**. However, they use likelihood functions in different ways and for different immediate purposes. Likelihoodists and Bayesians use them in ways that conform to the *Likelihood Principle*, according to which the evidential meaning of $E$ with respect to **H** depends only on the likelihood function of $E$ on **H**, while frequentists use them in ways that violate the Likelihood Principle (see Gandenberger 2014).

**Likelihoodists use likelihood functions to characterize data as evidence.** Their primary interpretive tool is the *Law of Likelihood*, which says that $E$ favors $H_1$ over $H_2$ if and only if their likelihood ratio $\mathcal{L}=\Pr(E|H_1)/\Pr(E|H_2)$ on $E$ is greater than 1, with $\mathcal{L}$ measuring the degree of favoring. Two major advantages of this approach are (1) it conforms to the Likelihood Principle and (2) it uses only the quantity $\mathcal{L}$, which is often objective because scientists often consider hypotheses that entail particular probability distributions over possible observations—for instance, the hypothesis that the mean of a normal distribution with a particular variance is zero. Even when the likelihood function is not objective, it is often easier to evaluate in a way that produces a fair degree of intersubjective agreement than the prior probabilities that Bayesians use. The great weakness of the likelihoodist approach is that it only yields a measure of evidential favoring, and not any immediate guidance about what one should believe or do.

**Bayesians use likelihood functions to update probability distributions in accordance with Bayes’s theorem.** Their approach fits nicely with the likelihoodist approach in that the ratio of the “posterior probabilities” (that is, the probabilities after updating on the evidence) $\Pr(H_1|E)/\Pr(H_2|E)$ on $E$ equals the ratio of the prior probabilities $\Pr(H_1)/\Pr(H_2)$ times the likelihood ratio $\mathcal{L}=\Pr(E|H_1)/\Pr(E|H_2)$. The Bayesian approach conforms to the Likelihood Principle, and unlike the likelihoodist approach it can be used directly to decide what to believe or do. Its great weakness is that using it requires supplying prior probabilities, which are generally based on either an individual’s subjective opinions or some objective but contentious formal rule that is intended to represent a neutral perspective.

**Frequentists use likelihood functions to design experiments that are in some sense guaranteed to perform well in repeated applications in the long run, no matter what the truth may be.** Frequentist tests, for instance, control both the probability of rejecting the “null hypothesis” if it is true (often at the 5% level) and the probability of failing to reject it if it is false to a degree that one would hate to miss (often at the 20% level). They violate the Likelihood Principle, but they provide immediate guidance for belief or action without appealing to a prior probability distribution.

**Warning:** I am about to describe an example that is difficult to understand without some specialized training. If you get lost, you can skip to where it says “upshot,” which tells you everything you need to know for the rest of the post.

Suppose we were to take a series of observations from a normal distribution with unknown mean and known positive variance. In other words, suppose we were to take a series of observations at random from a population that follows a “bell-shaped curve,” and we know the size and shape of the curve but not the location of its center. Suppose further that instead of deciding in advance on a fixed number of observations to take, we decided to keep sampling until the average observed value $\bar{x}$ was a certain distance from zero, where that distance started at some contant $k$ times the square root of the variance and decreased at the rate $1/\sqrt{n}$ as the sample size $n$ increased.

Armitage (1961) pointed out that two things will happen in such an experiment: – The experiment will end “almost surely” after a finite number of observations, no matter what the true mean may be. That is, the probability that the experiment goes on forever, with the mean of the observed values never getting far enough from zero to end the experiment, is zero. (It does not follow that it is *impossible* for the experiment to go on forever—it is *possible* get an endless string of 0 observations, for instance—hence the phrase “*almost* surely.”) – When the experiment ends, the likelihood ratio for the hypothesis $H_{\bar{X}}$ that the true mean is the observed sample mean against the hypothesis $H_0$ that the true mean is zero on the observed data will be at least $e^{\frac{1}{2}k^2}$.

**Caveat:** No one would ever run this experiment, and the average number of observations required to get a high degree of evidential favoring is enormous. Thus, one might be inclined to dismiss this example as irrelevant to statistical practice. It is nevertheless useful for illustrating and pressing on the principles that underlie Bayesian, frequentist, and likelihoodist approaches to statistical inference.

**Note:** Following standard notation in statistics, I use $\bar{X}$ to refer to the sample mean as a *random variable* and $\bar{x}$ to refer to the particular realized value of that random variable.

**This example looks bad for likelihoodists.** It shows that they are committed to the possibility of an experiment that has probability one of producing evidence that is as misleading as one likes with respect to the comparison between $H_{\bar{X}}$ and $H_0$. Frequentists avoid such possibilities: their primary aim is to control the probability that a given experiment will yield a misleading result. The great frequentist statistician David Cox went so far as to claim that “it might be argued that” this example “is enough to refute” the Likelihood Principle (2006).

**Let us not be too hasty, however.** The experiment has probability one of producing evidence that favors *some* hypothesis over $H_0$ to whatever degree one likes, even if $H_0$ is true. It does not have probability one of producing evidence that favors *any particular* hypothesis over $H_0$ to any particular degree. In fact, if $H_0$ is true, then the probability that *any* experiment produces evidence that favors any *particular* alternative hypothesis $H_a$ over $H_0$ to degree $k$ is at most $1/k$ (Royall 2000).

The fact that this experiment has probability one of producing evidence that favors *some* hypothesis over $H_0$ to *some* degree according to the Law of Likelihood even if $H_0$ is not a point against the Law of Likelihood. Even perfectly ordinary experiments do that, and it is clear that they do so not because the Law of Likelihood is wrong but because the evidence they produce is bound to be at least slightly misleading. Consider an experiment that involves taking a fixed number of observations from a normal distribution with unknown mean and known variance. The probability that the sample mean will be exactly equal to the mean of the distribution is zero, simply because the distribution is continuous. The Law of Likelihood will say that the evidence favors the hypothesis that the true mean equals the sample mean over the hypothesis that it equals zero even if it does in fact equal zero. But we are not inclined to reject the Law of Likelihood on those grounds: it seems to be correctly characterizing the evidential meaning of (probably only slightly) misleading data.

What makes the Armitage example apparently more problematic is that it has probability one of producing evidence that favors some hypothesis over $H_0$ *to whatever degree one likes*, even if $H_0$ is true. Thus, it seems to allow one to create not just misleading evidence, but *arbitrarily highly* misleading evidence at will, from the perspective of someone who accepts the Law of Likelihood.

But this gloss on what the example shows is selective and misleading. The evidence is arbitrarily misleading with respect to the comparison between the random hypothesis $H_{\bar{X}}$ an $H_0$, if $H_0$ is true. But it is not arbitrarily misleading with respect to the difference between the mean posited by the most favored hypothesis $H_{\bar{x}}$ and the true mean. In fact, **it merely trades off one dimension of misleadingness against another:** as one increases the degree to which the evidence is guaranteed to favor $H_{\bar{X}}$ over $H_0$, one thereby decreases the expected difference between the final sample mean $\bar{x}$ and the true mean of 0. In the absence of any principled way to weigh misleadingness along one dimension against misleadingness along the other, there is no principled argument for the claim (nor is it intuitively clear) that the Armitage example is any more misleading for those who accept the Law of Likelihood than the perfectly ordinary fixed-sample-size experiment that no one takes to refute the Law of Likelihood. Thus, it is at least unclear that the Armitage example refutes the Law of Likelihood either.

This example does, however, illustrate the point that it would be a mistake to adopt an unqualified rule of rejecting any hypothesis $H_0$ against any other hypothesis $H_1$ if and only if the degree to which one’s total evidence favors $H_1$ over $H_0$ exceeds some threshold. More generally, it does not seem to be possible to provide good norms of belief or action on the basis of likelihood functions alone, as I argue here. Relating likelihood functions to belief or action in a general way that conforms to the Likelihood Principle seems to require appealing to prior probabilities, as a Bayesian would do.

Armitage has provided a recipe for producing evidence with an arbitrarily large likelihood ratio $\Pr(E|H_{\bar{X}})/\Pr(E|H_0)$ even when $H_0$ is true. Bayesian updating on new evidence has the effect of multiplying the ratio of the probabilities for a pair of hypotheses by their likelihood function on that evidence. That is, in this case, $\Pr(H_{\bar{x}}|E)/\Pr(H_0|E)=\Pr(H_{\bar{x}})/\Pr(H_0)\times\Pr(E|H_{\bar{x}})/\Pr(E|H_0)$ Doesn’t the Armitage example thus provide a recipe for producing an arbitrarily large posterior probability ratio $\Pr(H_{\bar{X}}|E)/\Pr(H_0|E)$ on the Bayesian approach?

No. There are two problems. First, because the mean of the distribution is a continuous parameter, a Bayesian is likely to have credence zero in both the realized value of $H_{\bar{x}}$ and $H_0$. We should be dealing with probability *distributions* rather than discrete probability functions. (See previous post.) Second, the probability density at $H_{\bar{x}}$ varies with $\bar{x}$. Because proper probability distributions integrate to one, the ratio $p(H_{\bar{x}})/p(H_0)$ of the prior probability densities has to be less than $c$ for some $\bar{x}$ and any constant $c$, provided that $p(H_0)$ is not zero. Thus, **the Armitage example does not provide a recipe for producing an arbitrarily large ratio of posterior probability density values $p(H_{\bar{x}}|E)/p(H_0|E)$ on the Bayesian approach.**

**The Armitage example does not even provide a recipe for causing the probability the Bayesians assigns to $H_0$ to decrease.** That probability will decrease if and only if the Bayesian likelihood ratio $p(\bar{x}|H_0)/p(\bar{x}|\neg H_0)$ is less than one. (This likelihood ratio is Bayesian because $p(\bar{x}|\neg H_0)$ depends on a prior probability distribution over the possible true mean values. It is a ratio of probability *densities* because the sample space is discrete. This fact raises some technical issues, but we need not worry about them here—see Hacking 1965 57, 66-70; Berger and Wolpert 1988, 32-6; and Pawitan 2001, 23-4.) This result is not inevitable, and indeed is guaranteed to have probability less than one if $H_0$ is true. Moreover, the expected value of that likelihood ratio is guaranteed to be less than one if $H_0$ is true (Pawitan 2001, 239).

**The Armitage example does provide a recipe for causing the probability density ratio $p(H_{\mu_0})/p(H_0)$ to increase by any factor one likes** for *some* hypothesis $H_{\mu_0}$ positing a particular value $\mu_0$ other than 0 for the mean of the distribution, even if $H_0$ is true, provided that the probability density function is positive everywhere, but not for any *particular* value. **However, it is not clear that a Bayesian should be troubled by this result.** If he or she puts positive prior probability on $H_0$ and a continuous prior probability distribution everywhere else, then $p(H_{\mu_0})/\Pr(H_0)$ will remain zero. If he or she puts positive probability on $H_0$ and on some countable number of alternatives to $H_0$, then it is not inevitable that the result of the experiment will favor any of those alternatives over $H_0$. (The axioms of probability prohibit putting positive probability on an uncountable number of alternatives.) If he or she does not put positive probability on $H_0$, then he or she has no reason to be particularly concerned about the possibility of being misled with respect to $H_0$ and some alternative to it. See Basu (1975, 43-7) for further discussion.

The chief difference between frequentist treatments of the Armitage example, on the one hand, and Bayesian and likelihoodist treatments, on the other hand, is that **frequentists maintain that the fact that the experiment has a bizarre stopping rule and the fact that the hypothesis $H_\bar{x}$ was not designated for consideration independently of the data are relevant to what one can say about $H_\bar{x}$ in relation to $H_0$ in light of the experiment’s outcome.** Neither of those facts make a difference to the likelihood function, so neither of them make a difference to what one can say about $H_\bar{x}$ in relation to $H_0$ on a likelihoodist or Bayesian approach, or on any other approach that conforms to the Likelihood Principle. However, they do make a difference to long-run error rates with respect to $H_\bar{X}$ and $H_0$, and thus to what one can say about $H_\bar{x}$ in relation to $H_0$ on a frequentist approach that is designed to control long-run error rates.

**A frequentist would typically refuse to say anything about $H_\bar{x}$ in relation to $H_0$ in light of the outcome of an instance of the Armitage experiment.** He or she would insist that if one wanted to test $H_0$ against $H_\bar{x}$, then one would have to start over with a procedure that controlled long-run error rates with respect to those particular, fixed hypotheses. Some frequentists make some allowances for hypotheses that are not predesignated (e.g. Mayo 1996, Ch. 9), but they would never allow a procedure such as one that says to reject $H_0$ in favor of $H_{\bar{x}}$ if and only if the likelihood ratio of the latter to the former exceeds some threshold that have probability one of rejecting $H_0$ even if it is true. Violations of predesignation are permitted if at all only when the probability of erroneously rejecting the null hypothesis is kept suitably low.

A frequentist could draw conclusions about a *fixed* pair of hypotheses from an experiment with Armitage’s bizarre stopping rule. They would reject a fixed null hypothesis against a fixed alternative if and only if the likelihood ratio of the latter against the former exceeded some constant threshold chosen to keep the probability of rejecting the null hypothesis if it is false acceptably low. The likelihood ratio would depend not only on the observed sample mean, but also on the number of observations. Such a test is sensible from Bayesian and likelihoodist perspectives. In testing one point hypothesis against another, frequentists respect the Likelihood Principle within but not across experiments; they use likelihood-ratio cutoffs in the tests they sanction, but they allowing their cutoffs to vary across experiments involving the same hypotheses in the same decision-theoretic context and do not allow any conclusions to be drawn at all when predesignation requirements are grossly violated.

There is something intuitively strange about the idea that facts about stopping rules and predesignation are relevant to what conclusions one would be warranted in drawing from an experimental outcome. It seems natural to think that the degree to which data warrant a conclusion is a relation between the data and the conclusion only. From a frequentist perspective, it also depends on what the intentions of the experimenters were regarding when to end the experiment and which hypotheses to consider. The dependency on stopping rules is particularly strange: it makes the conclusions one may draw from the data depend on *counterfactuals* about what the experimenters would have done if the data had been different. How could such counterfactuals about the experimenter’s behavior be relevant to the significance of the actual data for the hypotheses in question? (See Mayo 1996, Ch. 10 for a frequentist response to this objection.)

Some frequentists consider the strange example discussed here to be a counterexample to the Likelihood Principle. However, I have argued that likelihoodist and Bayesian treatments of it are defensible, whereas frequentist violations of the Likelihood Principle are problematic.

To share your thoughts about this post, comment below or send me an email. Comments support $\LaTeX$ mathematical expressions: surround with single dollar signs for in-line math or double dollar signs for display math.

]]>I have been recommending the first chapter of Elliott Sober’s Evidence and Evolution to those who ask for a good introduction to debates about statistical inference. That chapter is excellent, but it would nice to be able to recommend something shorter that is readily available online. Here is my attempt to provide a suitable source. I presuppose some familiarity with probabilities but no formal training in probability theory.

Statistical inference is an attempt to evaluate a set of probabilistic hypotheses about the behavior of some data-generating mechanism. It is perhaps the most tractable and well-studied kind of inductive inference. The three leading approaches to statistical inference are Bayesian, likelihoodist, and frequentist. All three use likelihood functions, where the likelihood function for a datum $E$ on a set of hypotheses $\textbf{H}$ is $\Pr(E|H)$ (the probability of $E$ given $H$) considered is a function of $H$ as it varies over the set $\textbf{H}$. However, they use likelihood functions in different ways and for different immediate purposes. Likelihoodists and Bayesians use them in ways that conform to the *Likelihood Principle* (see Gandenberger 2014), according to which the evidential meaning of $E$ with respect to $\textbf{H}$ depends only on the likelihood function of $E$ on $\textbf{H}$, while frequentists use them in ways that violate the Likelihood Principle. Likelihoodists use likelihood functions to characterize data as evidence. Bayesians use them to update probability distributions. Frequentists use them to design experiments that have are in some sense guaranteed to perform well in repeated applications in the long run.

I start with a real example that illustrates why these issues matter. I then discuss a fictitious simplified variant on that example to illustrate how the Bayesian, likelihoodist, and frequentist approaches work in typical cases. In my next post, I will discuss a stranger example that better illustrates how those approaches can come apart.

In the 1980s, infants showing a particular pattern of respiratory problems had about a 20% survival rate until a team of researchers led by Robert Bartlett developed a new therapy called ECMO (extracorporeal membrane oxygenation) that led to the survival of seventy-two of the first hundred patients on whom they tried it, the first fifty of whom had already failed to respond to conventional therapy.

Despite their early successes, conventional standards of scientific evidence required Bartlett et al. to perform a randomized clinical trial in which ECMO and conventional treatments were used side-by-side in the same clinical setting and patient population. Concerned about the ethics of continuing to use the seemingly inferior conventional treatment, Bartlett et al. used an innovative “randomized play-the-winner” trial design that adjusted the probability that a given patient would receive a given treatment as the trial went along so that the treatment that had performed the best in the trial so far would be favored. The result was that all eleven infants given ECMO survived, and the one given conventional therapy died.

This result too looked rather compelling given available background knowledge, but because only one patient received conventional therapy it did not meet the conventional standard for establishing the efficacy of a new treatment. As a result, Ware led a second randomized trial. He was also concerned about the ethics of continuing to use the seemingly inferior conventional treatment, so he designed his trial to have two phases: it would be randomized until four patients died on either treatment, and then it would continue using exclusively the other treatment. The result was that 28 out of 29 patients receiving ECMO survived, while 6 of 10 receiving conventional therapy died.

That result also looks convincing, but it too failed to meet the conventional standard for establishing the efficacy of a new treatment. As a result, a group of researchers in the UK carried out a third randomized trial. Not surprisingly, that trial had to be terminated when early results clearly indicated ECMO’s superiority, but not until fifty-four more infants had died under conventional therapy.

As the parent of a child who was hospitalized with severe respiratory problems in the first month of life, this story makes my blood boil. **It illustrates the enormous costs of the failure of philosophers, statisticians, and scientists to reach consensus on a reasonable, workable approach to statistical inference in science.**

The standard of evidence that led to this debacle was a frequentist one. However, the example does not provide a knockdown argument against frequentist approaches generally, but only against the rigid and simplistic way in which frequentist ideas were applied in this particular case. Frequentist methods of meta-analysis, for instance, could have been used to pool the results of the first two trials and to make a case against the need for a third trial. That being said, one great advantage that likelihoodist and Bayesian methods have over frequentist methods is that they make it much easier to combine data from disparate sources.

I will now present a fictitious variant on the example above to better illustrate how the likelihoodist, Bayesian, and frequentist approaches to statistical inference work. Suppose the prevailing survival rate on conventional therapy was 50% and that nine of first twelve patients treated with ECMO had survived. **What would likelihoodists, Bayesians, and frequentists say about the proposition that the probability of survival on ECMO is greater than the prevailing rate?**

Likelihoodists use likelihood functions to characterize data as evidence. Their primary interpretive tool is the Law of Likelihood, which says that $E$ favors $H_1$ over $H_2$ if and only if $\mathcal{L}=\Pr(E|H_1)/\Pr(E|H_2)$ is greater than one, with $\mathcal{L}$ measuring the degree of favoring.

The Law of Likelihood does not apply in a straightforward way to the hypothesis that the chance of survival on ECMO is greater than 50%. That hypothesis is a *composite* statistical hypotheses; that is, it is a *disjunction* of many hypotheses that do not all assign the same probability to the observed experimental result. The probability that nine out of twelve patients survive given that the probability of a given patient surviving is $p$ is well-defined for each $p$, but not for the claim that $p$ is in some finite range.

We can use the Law of Likelihood to characterize the degree to which $E$ favors the hypothesis that the probability of survival is some particular number $p>$50% over the hypothesis that it is 50%. For instance, let “$H_p$” refer to the hypothesis that the probability that a given patient survives is $p$. Then according to the Law of Likelihood, the datum $E$ that nine out of twelve patients treated with the ECMO survived favors the hypothesis $H_{75\%}$ that the probability of survival is 75% over the hypothesis $H_{50\%}$ that it is 50% to the degree $\Pr(E|H_{75\%})/\Pr(E|H_{50\%})=4.8$.

Royall (2000, 761) suggests treating a likelihood ratio of 8 as the cutoff for declaring a piece of data to be “fairly strong evidence” in favoring one hypothesis over another, and a likelihood ratio of 32 as the cutoff for “strong evidence.” By this standard, **$E$ favors for $H_{75\%}$ over $H_{50\%}$, but not to a “strong” or “fairly strong” degree.**

One could also ask about the degree to which the evidence favors $H_p$ over $H_q$ for any pair of survival rates $p$ and $q$. For instance, the Law of Likelihood says that $E$ favors $H_{75\%}$ over $H_{20\%}$ to a very high degree indeed (4457).

The plot below shows the degree to which $E$ favors $H_p$ over $H_{50\%}$ as a function of $p$, according to the Law of Likelihood.

From a likelihoodist perspective, there is no need to decide ahead of time which questions to ask, and it is completely legitimate to ask all of them simultaneously. This feature of the likelihoodist approach distinguishes it from the frequentist approach, as we will see below and in more detail next week.

Bayesians use likelihood functions to update probabilities rather than treating them as objects of interest in their own right. They contend that a rational agent has degrees of belief that conform to the axioms of probability, which he or she updates by conditioning. That is, if one learns the proposition $E$ with certainty and nothing else, then one should replace one’s prior degree of belief $\Pr(H)$ in any proposition $H$ with one’s prior degree of belief $\Pr(H|E)$ in $H$ conditional on $E$, which is given by Bayes’s theorem:

$$ \Pr(H|E)=\frac{\Pr(E|H)\Pr(H)}{\Pr(E|H)\Pr(H)+\Pr(E|\neg H)\Pr(\neg H)}$$

This updating rule has a nice connection with the Law of Likelihood: the posterior odds for a pair of hypotheses on this update rule is their prior odds times their likelihood ratio. That is,

$$\frac{\Pr(H_1|E)}{\Pr(H_2|E)}=\frac{\Pr(H_1)}{\Pr(H_2)}\frac{\Pr(E|H_1)}{\Pr(E|H_2)}$$

Now, a hypothesis like $H_{75\%}$ that posits that a continuous parameter (in this case, the chance of survival for an infant treated with ECMO) has a particular, sharp value will typically have prior probability zero. When considering such hypotheses, we need to use probability *densities*, which are the continuous analogues of discrete probability distributions. The probability that a continuous quantity is in any finite interval is given by the area under the probability density curve within that interval (or, equivalently, its integral over that interval). For instance, the figure below shows a reasonable prior probability density over the possible values of the parameter giving the chance of survival for someone who receives ECMO. The area of the blue region is the prior probability that the chance of survival is between 45% and 55%.

The equations above still hold when probabilities are replaced with probability densities. We can now see how the Bayesian approach would handle the example considered above. Using $p(H)$ rather than $\Pr(H)$ for probability densities, the continuous analogue of the odds equation tells us

$$\frac{p(H_{75\%}|E)}{p(H_{50\%}|E)}=\frac{p(H_{75\%})}{p(H_{50\%})}\frac{\Pr(E|H_{75\%})}{\Pr(E|H_{50\%})}=\frac{p(H_{75\%})}{p(H_{50\%})}\times 4.8$$

and

$$\frac{p(H_{75\%}|E)}{p(H_{20\%}|E)}=\frac{p(H_{75\%})}{p(H_{20\%})}\frac{\Pr(E|H_{75\%})}{\Pr(E|H_{20\%})}=\frac{p(H_{75\%})}{p(H_{20\%})}\times 4475$$

Suppose for the sake of illustration that one’s prior degrees of belief are appropriately represented by the figure above. Then one has

$$p(H_{20\%})=.77$$

$$p(H_{50\%})=1.4$$

$$p(H_{75\%})=1.3$$

and thus

$$\frac{p(H_{75\%})}{p(H_{50\%})}=.928$$

$$\frac{p(H_{75\%}|E)}{p(H_{50\%})|E)}=.928\times 4.8= 4.5$$

and

$$\frac{p(H_{75\%})}{p(H_{20\%})}=4475$$

$$\frac{p(H_{75\%}|E)}{p(H_{20\%})|E)}=1.7\times 4475 = 7555$$

The results for the entire posterior probability distribution are given by the orange curve in the figure below.

On a Bayesian approach, one can also assess the composite hypotheses that the chance of survival is greater than 50% and less than 50%, respectively. Again using the same probability distribution, one gets

$$\Pr(p<50\%)=.42$$

$$\Pr(p\geq 50\%)=.58$$

and

$$\Pr(p<50\%|E)=.05$$

$$\Pr(p\geq 50\%|E)=.95$$

**Thus, the data raise the probability that ECMO produces a survival rate higher than the prevailing rate on conventional therapy from $.58$ to $.95$, on the particular prior probability distribution used here.**

One great advantage of the Bayesian approach is that it tells one exactly to what degree one should believe a given hypothesis on the basis of a given piece of evidence. The great disadvantage is that it does so only relative to a given degree of belief in the hypothesis prior to receiving the evidence.

Frequentists generally reject the Law of Likelihood and the use of Bayesian probabilities in science. Their theory was originally developed and justified exclusively in terms of long-run error rates on decisions about how to behave with regard to hypotheses, rather than in terms of the degree to which the data support judgments about the alethic or epistemic value of particular hypotheses. As Neyman and Pearson put it in their original presentation of the frequentist approach, “without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in the following which we insure that, in the long run of experience, we shall not too often be wrong” (1933, 291). In practice, however, the outputs of frequentist methods are typically interpreted in terms of evidence and belief. The error-statistical philosophy developed primarily by Deborah Mayo is an ambitious attempt to develop and defend such interpretations.

A typical frequentist approach to the example under discussion would be to designate the hypothesis that ECMO is no more effective than conventional therapy (50% survival or less) the “null hypothesis” $H_0$ and to test it against the “alternative hypothesis” $H_a$ that ECMO is better than conventional therapy (greater than 50% survival). A trial would be designed to control both the probability of rejecting $H_0$ if it is true (called “the Type I error rate”) and the probability of failing to reject it if it is false to some degree that one would hate to miss (called “the Type II error rate”)—perhaps a 60% survival rate in this case. The usual approach to controlling these error rates is to choose the Type I error rate that one is willing to accept (often 5%); choose a trial design with maximum power (i.e., minimum Type II error rate) at that Type I error rate; and choose the sample size (in this case, the number of patients to treat) that makes the Type II error rate acceptably low (often 20%).

**On a frequentist approach, what one can conclude from the data depends greatly on the design of the experiment that produced the data.** Type I and Type II error rates are properties of repeatable procedures rather than of particular instances of those procedures. For this reason, frequentists generally proceed in accordance with protocols that they specify ahead of time. Otherwise, they would face often unanswerable questions about what repeatable procedure they were implementing.

An experimental protocol typically specifies both when the experimenters are to look at the data and what they are to conclude from various possible observations. If the trial protocol does not call for looking at the data after nine of the first twelve patients survived, then a frequentist cannot conclude anything from that datum. If it does call for looking at the data at that point, then what he or she can conclude may depend on when else the protocol would call for looking at the data.

For instance, suppose that the trial protocol calls for looking at the data once, after three patients have died. The most powerful test with Type I error rate no more than 5% rejects the null hypothesis in this case if and only if it takes twelve or more patients to reach three deaths. Thus, under this stopping rule, a frequentist could conclude from nine of the first twelve patients surviving that the new treatment is more effective than the old one.

On the other hand, suppose that the trial protocol calls for looking at the data once, after twelve patients have been treated. The most powerful test with Type I error rate no more than 5% rejects the null hypothesis in this case if and only if ten or more of those patients survive. Thus, under this stopping rule, a frequentist could *not* conclude from nine of the first twelve patients surviving that the new treatment is more effective than the old one.

Frequentists are not permitted to decide what questions to ask after looking at the data, except in limited ways that must be carefully prescribed ahead of time. Long-run error rates can be controlled with respect to particular questions, but not with respect to any and all questions simultaneously.

Frequentist methods’ sensitivity to “stopping rules” (i.e., the rules that tell the experimenters when to stop collecting data and draw conclusions) and to whether or not questions were predesignated is in violation of the Likelihood Principle: those factors have no effect on the likelihood function, and thus are according to the Likelihood Principle irrelevant to the evidential meaning of the data.

Likelihoodist methods characterize the data as evidence with respect to pairs of simple statistical hypotheses. Bayesian methods use the data to update a probability distribution over the hypothesis space. Likelihoodist and Bayesian methods conform to the Likelihood Principle and fit together nicely.

Frequentist methods are rather different. Their creators regarded them not as providing assessments of the epistemic statuses of individual hypotheses, but instead as merely controlling long-run error rates. They violate the Likelihood Principle for the sake of controlling long-run error rates.

In my next post, I plan to discuss an example that brings out the difference among these approaches more clearly.

To share your thoughts about this post, comment below or send me an email. Comments support $\LaTeX$ mathematical expressions: surround with single dollar signs for in-line math or double dollar signs for display math.

]]>