## Victor A. Albert

Print publication date: 2006

Print ISBN-13: 9780199297306

Published to Oxford Scholarship Online: September 2007

DOI: 10.1093/acprof:oso/9780199297306.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2019. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use.  Subscriber: null; date: 23 October 2019

# Parsimony and its presuppositions

Chapter:
(p.43) Chapter 3 Parsimony and its presuppositions
Source:
Parsimony, Phylogeny, and Genomics
Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780199297306.003.0003

# Abstract and Keywords

The use of a principle of parsimony in phylogenetic inference is both widespread and controversial. It is controversial because biologists, who view phylogenetic inference as first and foremost a statistical problem, have pressed the question of what one must assume about the evolutionary process if one is entitled to use parsimony in this way. They suspect, not just that parsimony makes assumptions about the evolutionary process, but that it makes highly specific assumptions that are often implausible. That it must make some assumptions seems clear to them because they are confident that the method of maximum parsimony must resemble the main statistical procedure used to make phylogenetic inferences: the method of maximum likelihood. Likelihoodists suspect that parsimony nonetheless involves an implicit model. The question for them is to discover what that model is. This chapter discusses parsimony's ostensive presuppositions by examining the relationship that exists between maximum likelihood and maximum parsimony among simple examples in which parsimony and likelihood disagree.

# 3.1 Introduction

The use of a principle of parsimony in phylogenetic inference is both widespread and controversial. It is controversial because biologists who view phylogenetic inference as first and foremost a statistical problem have pressed the question of what one must assume about the evolutionary process if one is entitled to use parsimony in this way. They suspect not just that parsimony makes assumptions about the evolutionary process but that it makes highly specific assumptions that are often implausible. That it must make some assumptions seems clear to them because they are confident that the method of maximum parsimony must resemble the main statistical procedure that is used to make phylogenetic inferences, the method of maximum likelihood.1 Maximum likelihood requires the explicit statement of a probabilistic model of the evolutionary process. Parsimony does not; you can calculate how parsimonious different tree topologies are for a given data set without stating a process model. Likelihoodists suspect that parsimony nonetheless involves an implicit model. The question, for them, is to discover what that model is.

Cladists who defend the criterion of maximum parsimony often reply that parsimony does make assumptions about evolution, but that those assumptions are modest and unproblematic. For example, cladists sometimes claim that parsimony assumes just that descent with modification has occurred. This suggests that the disagreement between critics and defenders of parsimony is not about whether parsimony makes assumptions about the evolutionary process, but concerns what those assumptions are and whether they are troublesome. But perhaps more important is the fact that critics and defenders also disagree about how those assumptions should be unearthed and evaluated. Defenders of maximum likelihood approach this problem by embedding the principle of parsimony in a statistical framework; they evaluate parsimony by examining it through the lens of probability. Defenders of parsimony often reject the use of statistics and probability as a criterion for evaluating parsimony; as Farris (1983/1994, p. 342) says, “the modeling approach was wrong from the start.” His preferred alternative is to evaluate (and justify) parsimony in terms of what he takes to be the more basic idea that the best phylogenetic hypothesis is the one that has the most explanatory power.

There are many dimensions to this dispute—too many to discuss in the brief compass of the present chapter. What I wish to concentrate on here is the relationship that exists between maximum likelihood and maximum parsimony. Felsenstein (1973, 1979) and Tuffley and Steel (1997) have each identified models of the evolutionary process that suffice to insure that the two methods always agree on which hypothesis is best supported by a given data set. I will argue that these results provide only negative guidance concerning what parsimony presupposes. They allow one to establish that this or that proposition is not assumed by parsimony, but do not allow one to conclude that any proposition is an assumption that parsimony makes. To discover what parsimony presupposes, another strategy is needed. I suggest that parsimony's presuppositions can be found by examining simple (p.44) examples in which parsimony and likelihood disagree. My arguments will assume a broadly likelihoodist point of view, but will not require the assumption that any evolutionary model is correct.

# 3.2 Preliminaries

The principle of parsimony does not provide a rule of acceptance; rather, it provides a rule of evaluation. That is, the principle does not tell you to believe the phylogenetic hypothesis that requires the fewest changes in character state to explain the data at hand. After all, if the most-parsimonious tree requires that there be at least 25 changes, and the second and third most-parsimonious trees require that there be 26 and 27 respectively, the most you should conclude is that the most-parsimonious tree is better supported than the others; you are not obliged to conclude that the most parsimonious tree is true. In other words, parsimony would be a sound principle if the parsimony ordering of phylogenetic hypotheses and the support ordering of those hypotheses came to the same thing:

1. (1) For any data set D, and any phylogenetic hypotheses H 1 and H 2, D supports H 1 more than D supports H 2 if and only if H 1 is a more parsimonious explanation of D than H 2 is.

If (1) is always true, I will say that parsimony is ‘correct’ in what it says. And if (1) is true in some restricted domain, I will say that parsimony is correct in what it says about that domain.

Likelihood likewise seeks to provide a rule of evaluation, not a rule of acceptance. If one hypothesis confers a higher probability on the data than another hypothesis does, it does not follow that the first hypothesis is true; in fact, it doesn't even follow that the first has the higher probability of being true. The fact that Pr(Data | H 1) > Pr(Data | H 2) does not entail that Pr(H 1 | Data) > Pr(H 2 | Data). Rather, the virtue that has been claimed for the likelihood concept is that it provides an indication of which hypotheses are better supported by the data. The following principle has come to be called the Law of Likelihood (Hacking 1965, Edwards 1972, Royall 1997):

1. (2) For any data set, and any phylogenetic hypotheses H 1 and H 2, the Data support H 1 more than they support H 2 if and only if Pr(Data | H 1) > Pr(Data | H 2).

Proposition (2) restricts likelihood to hypotheses that describe phylogenetic relationships. I state the Law of Likelihood in this way to preserve its symmetry with (1), even though likelihood is supposed to be a perfectly general criterion for evaluating the direction in which the evidence points. Proposition (2) is not a consequence of the axioms of probability; it is not a mathematical truth, but rather is a philosophical thesis—that the epistemological concept of support is adequately represented by the mathematical concept of likelihood. Just as one can ask whether, or in what circumstances, (1) is true, the same questions can be posed about (2). If (2) is always true, then I will say that likelihood is ‘correct’ in what it says. And if (2) is true in some restricted domain, I will say that likelihood is correct in what it says about that restricted domain.

How are (1) and (2) related? If there is a data set and a pair of hypotheses H 1 and H 2 such that the parsimony ordering and the likelihood ordering do not agree (e.g. where H 1 is more parsimonious than H 2, but Pr(Data | H 1) < Pr(Data | H 2)), then (1) and (2) cannot both be true. On the other hand, if parsimony and likelihood agreed about the relative support of any two hypotheses for any data set you please, then (1) and (2) would be perfectly compatible. The fact that one is stated in terms of likelihood and the other in terms of parsimony would be no more significant than the difference between measuring distance in meters and measuring it in feet.

Proposition (2) gives a somewhat misleading picture of what it means to apply the Law of Likelihood to phylogenetic hypotheses. The problem is that phylogenetic hypotheses that describe the topology of a tree—not the times of branching events, or the amount of change that has taken place on branches, or the character states of interior nodes—do not, all by themselves, confer probabilities on the data. In the language of statistics, phylogenetic hypotheses are composite, not simple There are two possible solutions to this problem. The first is Bayesian; one represents the likelihood of a phylogenetic hypothesis H as a weighted average. H will vary in its likelihood, depending on which process model is considered, and depending also on what the values are for the (p.45) parameters that occur in a process model. The ‘full’ Bayesian approach is to take all these possibilities into account, weighting each by its probability, conditional on H:

Pr(Data | H) = ∑ij Pr(Data | H & process model i & values j for the parameters in model i) × Pr(process model i & values j for the parameters in model i | H)2

Although biologists are starting to explore Bayesian methods in phylogenetic inference (see e.g. Huelsenbeck et al. 2001), no one has proposed to represent a hypothesis' likelihood by averaging over all possible process models;3 rather, Bayesians have tended to adopt a single process model M and to average over the different values that the parameters in M might take. According to this ‘attenuated’ Bayesian approach, the likelihood of H should be written as PrM(Data | H), not as Pr(Data | H), where
1. (B) PrM(Data | H) = ∑j Pr(Data | H & model M & values j for the parameters in M) × Pr(values j for the parameters in model M | H&M)

Whereas Bayesians treat the likelihood of H (once a model has been adopted) as a weighted average over the likelihood that H would have under the different possible settings of the model's parameters, frequentists treat the likelihood of H (given an assumed model) by finding L(H&M), where L(H&M) is the likeliest special case of the conjunction (H&M); it is found by setting the adjustable parameters in M to values that maximize the likelihood of (H&M).4 For them, the appropriate quantity is
$Display mathematics$
Whereas likelihood means average likelihood for a Bayesian, likelihood means best-case likelihood for a frequentist. We will return to this difference between Bayesian and frequentist treatments of likelihood in a moment. For now, let's focus on what they have in common—both evaluate the likelihood of H by assuming a process model M.5

How should this recognition of the model-relativity of likelihoods be incorporated in (2)? If different models are correct for different data sets and different taxa, there won't be a single ‘master model’ that should be used to evaluate the support of all phylogenetic hypotheses. Rather, what we need is the following:

1. (2*) If M is the correct model for how the characters described in a data set evolved in the taxa described in phylogenetic hypotheses H 1 and H 2, then the Data support H 1 more than they support H 2 if and only if PrM(D | H 1) > PrM(D | H 2).

Just as Proposition (2) is a philosophical thesis, not a mathematical truth, the same point holds for (2*). If (2*) is true, then I'll say that likelihoodM is correct for the taxa and data set in question.

I earlier described how (1) and (2) can come into conflict. What would it take for (1) and (2*) to conflict? You need the same ingredients as before, plus a model M that is correct for the taxa and characters involved. That is, consider a pair of phylogenetic hypotheses, a data set, and a model M, where the parsimony ordering of the hypotheses differs from their likelihoodM ordering. If you accept (2*) and also think that model M is correct, then you are obliged to accept the judgment of likelihoodM and reject the judgment of parsimony concerning which hypothesis is better supported by the data. Notice that there are two if's in this italicized statement. This means that if you do not reject what parsimony says about the hypotheses, there are two options available, not just one. You can reject model M or you can reject (2*). That is, cladists are not obliged to reject model M; they also have the option of rejecting the Law of Likelihood as it is embodied in (2*).

I so far have described how a model can lead to a conflict between (1) and (2*). However, it is equally true that there are models of the evolutionary process that lead to a perfect harmony (p.46) between (1) and (2*). Such models lead parsimony and likelihoodM to be ordinally equivalent:

1. (OE) Parsimony and likelihoodM are ordinally equivalent if and only if, for any data set D, and any pair of phylogenetic hypotheses, the parsimony ordering of that pair is the same as the likelihoodM ordering.

If a model M induces ordinal equivalence, what does that establish about the legitimacy of parsimony and likelihoodM? If you accept the model and regard one method as legitimate, then you should regard the other method as legitimate as well. In this circumstance, likelihoodists will say that M provides a likelihood justification of parsimony, whereas friends of parsimony will say that M provides a parsimony justification of likelihoodM. On the other hand, if you do not accept the model that induces ordinal equivalence, the status of the two methods is left open; for example, both could turn out to be unsatisfactory methods for evaluating the support of phylogenetic hypotheses. The point to notice here is that (OE) says nothing about whether parsimony is correct; it merely says what it means for parsimony and likelihoodM to be in the same boat; if parsimony and likelihoodM are ordinally equivalent, then both are correct or neither is. Two broken thermometers can be ordinally equivalent in what they say about the temperatures of different objects.

In summary, the model-relativity of likelihood entails that we are asking the wrong question when we ask “what is the relationship between likelihood and parsimony?” The word ‘the’ is where the trouble lies; there are many likelihood concepts (one for each possible model of the evolutionary process) and so there are many relationships between the different likelihood concepts and parsimony. More specifically, if we adopt (2*), the following two lines of reasoning are valid.

• If model M is correct, then likelihoodM correctly evaluates support. If likelihoodM has this property, and moreover is ordinally equivalent with parsimony, then parsimony also correctly evaluates support.

• If parsimony and likelihoodM are not ordinally equivalent and parsimony correctly evaluates support, then likelihoodM does not. If likelihoodM does not correctly evaluate support, then M cannot be correct.

The first line of reasoning describes what would be true if likelihoodM and parsimony were not just in perfect agreement, but additionally had the property of correctly evaluating support. The second describes how a failure of ordinal equivalence can help uncover a presupposition of parsimony—if parsimony correctly evaluates support, then process model M must be false. Notice that both lines of reasoning require (2*). A more thorough investigation would address the question of why one should accept this formulation of the Law of Likelihood. This is a topic I will not take up here; I'll assume (2*) without trying to justify it.

# 3.3 How to determine what parsimony does not presuppose

A number of writers have attempted to find models that induce ordinal equivalence. Three have succeeded, Felsenstein (1973, 1979) and Tuffley and Steel (1997). In Felsenstein's model, characters are constrained to have very low probabilities of changing state, but there is no requirement that the probability of a character's changing from state i to state j on a branch is the same as its probability of changing from state j to state i. In Tuffley and Steel's, characters can have high probabilities of changing state (though they need not), but the probabilities of change must be symmetrical.6 The models are very different, but each entails ordinal equivalence.

Both Felsenstein, and Tuffley and Steel, evaluate the likelihoods of phylogenetic hypotheses by using the frequentist approach (F) for assigning values to the parameters in the models they discuss. For example, consider a single site in the aligned sequences that characterize four species W, X, Y, and Z. Suppose that W and X are in state G and that Y and Z are in state A. The most parsimonious unrooted tree is (WX)(YZ). Under the (p.47) symmetrical model that Tuffley and Steel assume, the highest likelihood this tree can have, relative to this character, is (¼)(¼)(1)(1)(1)(1) = 1/16, and this is the likelihood that Tuffley and Steel take the unrooted tree to have.7 A Bayesian would want to consider the average likelihood of (WX)(YZ), not the maximum.

There are other attempts in the literature to establish ordinal equivalence. Farris (1973) tried to prove this result by using a model that makes very weak assumptions about the evolutionary process. Goldman (1990) sought to do the same thing by using a model in which the probability that a character will change state on a branch is independent of the branch's duration. Both these efforts fail to establish ordinal equivalence because both interpret parsimony as inferring not just the topology of a tree but something more inclusive. Goldman viewed parsimony as a procedure for inferring the topology plus an assignment of character states to interior nodes; Farris took parsimony to output the topology plus an assignment of character states to all points along the branches. Why does this vitiate the arguments that Farris and Goldman present? The reason is that even if H 1&X1 is more likely and more parsimonious than H 2&X2, and H 1 is more parsimonious than H 2, it doesn't follow that H 1 is more likely than H 2 (Felsenstein 1973; Sober 1988; Steel and Penny 2000). The likelihood of a tree topology must sum over all possible assignments of character states to points in the tree's interior.

When a model induces ordinal equivalence, what does this reveal concerning parsimony's presuppositions about the evolutionary process? It most certainly does not show that parsimony assumes that the model is true. The models of Felsenstein (1973, 1979) and of Tuffley and Steel (1997) are each sufficient conditions for ordinal equivalence. No one has shown that either of these models is necessary for ordinal equivalence. And, obviously, neither of them is; if each of two models suffices, neither is necessary. We must be careful to distinguish what a modeler assumes from what the model reveals concerning what parsimony assumes (Sober 1988, 2004a).

Still, if we accept the instance of the Law of Likelihood given by (2*), these results about ordinal equivalence provide a partial test for whether parsimony assumes this or that proposition about the evolutionary process (Sober 2002, 2004a). As noted earlier, parsimony and likelihoodM can be ordinally equivalent even if both are wrong in what they say about support. However, if they are ordinally equivalent and model M is true, then both are correct, given (2*). Consider a model M that induces ordinal equivalence; M might be Felsenstein's model, or the one described by Tuffley and Steel, or some third model that no one has yet identified:

$Display mathematics$
If model M is true (where M induces ordinal equivalence), then parsimony is correct in what it says about support (and so is likelihoodM, of course). What does parsimony assume? The assumptions (A) of parsimony are just those propositions that must be true, if parsimony is correct in what it says about support. Notice that any proposition that is entailed by the claim that parsimony is correct also must be entailed by model M. However, the converse is not true—if model M entails a proposition, that proposition may or may not be entailed by the thesis that parsimony is correct. This means that the results of Felsenstein (1973, 1979) and of Tuffley and Steel (1997) provide the following test concerning whether parsimony assumes that proposition X is true:
• If model M entails X, then X may or may not be an assumption of parsimony's.

• If model M does not entail X, then X is not an assumption of parsimony's.

Applying this partial test yields some surprising results. First, many biologists have suspected that parsimony assumes that changes in character state are very improbable and that homoplasies are rare; from a likelihood point of view, this suspicion is provably mistaken. The reason is that the Tuffley and Steel model does not entail that changes are (p.48) improbable or that homoplasies are rare. Second, it follows that parsimony does not assume that change is symmetrical; the reason is that the Felsenstein model does not assume this. These results depend on using the Law of Likelihood (2*); but once that interpretative framework is adopted, these results are secure.

As illuminating as these results are, they still have the limitation of being purely negative. The partial test can show that this or that proposition is not an assumption that parsimony makes, but the test isn't able to demonstrate that a given proposition is assumed by parsimony. Results that demonstrate that a model induces ordinal equivalence have this inherent limitation. In order to obtain positive results concerning what parsimony assumes about the evolutionary process, a new strategy is needed.

# 3.4 How to determine what parsimony presupposes

Mathematical arguments for ordinal equivalence are necessarily general; they must show, for any data set and for any pair of phylogenetic hypotheses (which may describe an arbitrarily large number of taxa), that parsimony and likelihoodM agree about the support ordering. In contrast, an argument that demonstrates a failure of ordinal equivalence need not be general; it can just take the form of a simple example. All that is needed is a model M, a single data set, and a pair of hypotheses such that the parsimony ordering differs from the likelihoodM ordering. If parsimony is right in what it says, then likelihoodM is wrong. And if likelihoodM is wrong, so is the model M (assuming that 2* is true). Such cases therefore help reveal parsimony's presuppositions.

## 3.4.1 Example 1

Let's begin with a simple example in which the hypotheses being evaluated don't describe tree topologies, but rather assign character states to ancestors in trees that are taken as given. Imagine a bifurcating tree in which all the tip species are observed to have the same character state a. Parsimony asserts that the best-supported estimate of the character state of the most recent common ancestor A of those tip species is that A was also in state a. Parsimony's solution to the problem would remain the same if we were talking about a star phylogeny. In fact, distilled to its simplest form, the problem and parsimony's solution to it can be formulated by considering a single lineage that ends with a descendant D that is in state a; the problem is to infer what the character state was of the ancestor A that existed at the start of the lineage. Parsimony says that the best estimate is that A = a.

What would a likelihood analysis of this problem look like? If the character in question is dichotomous (with character states 0,1), the standard approach from the theory of stochastic processes (Parzen 1962) is to divide the lineage into a large number of brief temporal intervals. In each, there is a probability u that the lineage will change from state 0 to state 1, and there is a (possibly different) probability v that the lineage will change from state 1 to state 0. Each of these instantaneous probabilities are assumed to be small (at least less than one-half). They allow us to describe the probability PrN(ij) that a lineage that is N units of time in duration will end in state j, given that it starts in state i. These lineage transition probabilities are as follows:

$Display mathematics$
This is the two-state Markov process model. Each of these transition probabilities averages over all possible scenarios consistent with the specified initial and end states. For this reason, it would be misleading to say that the two transition probabilities of the form PrN(ii) describe the probability of stasis; PrN(ii) encompasses the possibility that there has been no change in the lineage but also the possibility that the lineage has flip-flopped an even number of times. There is no assumption in this model as to whether u = v. If u = v, the lineage undergoes an unbiased process of drift. If u > v, there is a directionality or bias in the evolutionary process, favoring state 1 over state 0. One possible source of this bias is natural selection; however, mutation and migration also can induce a bias in how the lineage tends to evolve.

(p.49) When N is very small, the two probabilities of the form PrN(ii) are close to unity and the two probabilities of the form PrN(ij) (where ij) are close to 0. When N is infinite PrN(ij) = PrN(jj); the lineage has the same probability of ending in state j, regardless of what the state was in when the lineage began. Thus, when a lineage has a very short duration, its initial condition virtually determines its final state and the relationship of u and v doesn't matter; when a lineage is very old, it is the process that occurs during the lineage's duration (represented by u and v) that matters; the initial condition is forgotten.

It is a property of this model that a backwards inequality obtains: PrN(jj) ≥ PrN(ij), with strict inequality when N is finite (Sober 1988). Don't confuse the backwards inequality with the forwards inequality PrN(jj) > PrN(ji). An instance of this forwards inequality (e.g. that PrN(1→1)>PrN(1→0)) will be true for some values of u, v, and N, but not for others. The backwards inequality says that if a descendant is in state j, that outcome is made more probable by the hypothesis that its ancestor was in state j than by the hypothesis that the ancestor was in state i. This provides a likelihood solution to our problem: if the descendant is in state a of a dichotomous character, the hypothesis of maximum likelihood about the state of the ancestor is that the ancestor was in state a as well. This result holds regardless of what the values of u, v, and N are; even if these values entail that the expected number of changes in the lineage is large, the most-parsimonious assignment of character state to the ancestor is still the assignment of maximum likelihood. Parsimony and likelihoodM therefore agree when M is the two-state Markov process model and the problem is to infer an ancestor's character state from the character state of a descendant.8

In analyzing this simple problem, I used the Bayesian method (B), not the frequentist procedure (F), for taking account of the fact that the two-state Markov model can have different values assigned to its parameters u, v, and N. I didn't focus exclusively on the values for these parameters that would maximize the likelihood of each hypothesis about the state of ancestor A; that would have led to the conclusion that the two likelihoods are as close together as you please, since

$Display mathematics$
and
$Display mathematics$
Rather, my argument is that for each value of u, each value of v (each less than 0.5), and for each finite value of N,
$Display mathematics$
From this it follows that
$Display mathematics$
$Display mathematics$
if the settings of u, v, and N are independent of the character state of the ancestor A. In this instance, the Bayesian weighting terms Pr(u=i & v=j & N = k | A=1) and Pr(u=i & v=j & N = k | A=0) are innocuous; the last stated inequality holds, no matter what their values are.

I now want to consider the same problem—that of inferring the character state of the ancestor A from the observed character state of the descendant D—when the character in question is a quantitative phenotype (e.g. the average length in the species of a particular bone), not dichotomous. It remains true, of course, that if the descendant is in state a, then the most-parsimonious hypothesis about the state of the ancestor is that it was in state a as well. To see what likelihood says about this problem, we need to construct a probabilistic model of the (p.50) evolution of the quantitative character. Let's begin by setting limits on the values of the character in question; suppose it can't go below zero or above 100. We can think of u as the probability of the lineage's increasing its character state by a very small amount during a brief interval of time, and v as the probability of the lineage's reducing its value during that instant. Since there are upper and lower bounds on the character state, u and v cannot remain constant over the full range of the lineage's possible states; for example, u must have a value of zero when the lineage is in state 100, though of course it can have a nonzero value when the lineage has a value less than 100.9 In addition, we want to allow for the possibility that the lineage is evolving towards a stable equilibrium; for example, perhaps a trait value of 75 is optimal, and selection is pushing the lineage towards that value. This means that u > v when the lineage's trait value is less than 75, but that u < v when the population has a value greater than 75. In addition, the degree to which u > v must decline as the population approaches 75 from below.

When a biased process (such as natural selection) is pushing a lineage towards a single attractor state, the lineage's probability of reaching that equilibrium is greater, the closer its initial state is to that attractor. Similarly, the equilibrium value has a higher probability of being attained, the more time there is in the lineage. When the lineage has a very short duration, stasis is almost certain; as the lineage is given a longer duration, the biased process takes over and the initial condition recedes in its impact on the lineage's final state. In the limit of infinite time, the initial condition is entirely forgotten and the lineage's probability of attaining a given end state is the same, regardless of what the state was in which the lineage began.

How should we conceptualize a pure drift process for continuous phenotypic characters? In this case, u = v, except when the lineage is at the limit values of 0 and 100. If the ancestor has a given trait value, that trait value is the expected value of its descendant. With very little time, the expected value of the descendant is tightly peaked around the lineage's initial state. As time goes on, this low variance bell curve flattens and spreads out. With infinite time, there is a flat distribution—each character state has the same probability. Whereas selection in a finite population involves both the shifting and the squashing of a distribution, the process of pure drift involves only squashing.10

Now let's return to the inference problem. If the descendant D is in state a of a quantitative phenotypic trait, which assignment of character state to the ancestor A has the highest likelihood? Since the backwards inequality holds for dichotomous characters, one might expect the model for continuous phenotypic characters to have the same unconditional consequence—that A = a has maximum likelihood. This is not always correct (Sober 2002). If D = a, then A = a is the maximum likelihood assignment when the process is one of pure drift (W. Maddison 1991). And if D = a and a is the optimal character state towards which natural selection is pushing, then A = a is again the maximum-likelihood assignment. However, if the descendant has a character state of, say, 40 and selection is pushing the lineage towards a value of 50, then the maximum likelihood assignment to the ancestor A will be less than 40; how much less than 40 the maximum likelihood value is depends on how long the lineage has been evolving, on how strong the directional force is, and on the character's heritability (Sober in press). This means that parsimony and likelihoodM conflict when the model says that there is a directional process whose attractor is some state different from a, the observed character state of the descendant. Thus, to defend the parsimonious assignment of A = a without rejecting the Law of Likelihood, one must reject this model. Parsimony assumes that the trait either evolves by pure drift or by a selection process in which the descendant's character state is optimal.

## (p.51) 3.4.2 Example 2

Consider two extant species A and B and their most recent common ancestor C. Suppose that A=1 and B=0; parsimony says that C=1 and C=0 are equally parsimonious. In what circumstances do these two assignments of character state to the ancestor have the same likelihood? That is, when will it be true that PrA(0→1)PrB(0→0) = PrA(1→1)PrB(1→0)? Here the subscripts A and B represent which of the two lineages the transition probabilities describe. It is helpful to rewrite this equality as

$Display mathematics$
If the two lineages experience the same evolutionary processes (i.e. are characterized by the same pair of u, v values), then this equality holds if and only if the duration N of the lineages is 0, or infinity, or u = v. That is, parsimony assumes that if the two lineages are of finite duration and have experienced the same evolutionary process, then that process is pure drift.

## 3.4.3 Example 3

The next problem is just like the previous one, except that the two lineages have unequal temporal durations. When A = 1 and B = 0, parsimony says that C = 1 and C = 0 are equally well-supported estimates of the ancestral character state even when A is a present-day species and B is a fossil. This temporal difference between A and B means that the lineage leading from C to A lasted longer than the lineage leading from C to B. The two-state Markov model views this difference as evidentially significant; recall that N, the duration of a lineage, figures in the expressions for the lineage transition probabilities. If the processes in the two lineages are characterized by the same values of u and v, then B provides stronger evidence (in the sense of a larger likelihood ratio) about the state of C than A does; likelihood will then favor C = 0 over C = 1. Parsimony denies this. Parsimony therefore assumes that the u and v values that characterize one lineage must differ from the u and v values that characterize the other. But not just any difference between the two pairs of values will suffice for C = 0 and C = 1 to have the same likelihood. The lineage with the longer duration (the one leading to A) must have a smaller value for u + v; in fact, the degree to which its value for u + v must be smaller is determined by the two lineage durations. This has the embarrassing consequence that the lineage leading to A must change its values for u and v in a very precise way as it gets older. At one point the lineage leading to A and the lineage leading to B were of equal duration. But then B went extinct while A continued to exist, so the lineage leading to A got longer while the one leading to B did not. According to parsimony, A's values for u and v must evolve in precise response to its duration and to the duration and u and v values that attach to the lineage leading to B.

## 3.4.4 Example 4

The next example in which a process model M induces a conflict between likelihoodM and parsimony concerns a rooted tree in which the character state of the root is specified. It is a familiar property of parsimony in this context that shared derived characters are said to provide evidence of common ancestry, but that shared ancestral characters do not. If three tip species A, B, and C, are in states A = 1, B = 1, and C = 0, with 0 taken to represent the ancestral character state, then (AB)C will be more parsimonious than A(BC). However, if the polarity is reversed, with 1 now representing the ancestral condition, then (AB)C and A(BC) will be equally parsimonious.

Consider the two-state Markov model given before on which an additional constraint is imposed, namely that the probability of one branch's ending in state i if it begins in state j (i,j = 0,1) is the same as another branch's doing the same, if the two branches are simultaneous. This model has the consequence that (AB)C has a higher likelihood than A(BC), given the observation that A = 1, B = 1, C = 0, if all branches have finite duration, regardless of what the polarity of the character is (Sober 1988, pp. 206–212). This contradicts what parsimony asserts, when 1 is the ancestral state. (p.52) Parsimony's interpretation of the observations therefore requires a rejection of the process model just described.

## 3.4.5 Discussion of the examples

If the Law of Likelihood, as formulated in (2*), is correct, then parsimony assumes the falsehood of any model M for which likelihoodM and parsimony are not ordinally equivalent. A summary of the models discussed in this chapter that parsimony assumes are false is provided in Table 3.1. These descriptions do not lay out the full details of the models that parsimony must reject. This is an important point, since these models are each conjunctions of several propositions. If parsimony assumes that model M is false, this means that at least one of the constitutive propositions that specifies the model must be false, not that all of them must be. So don't take the table's brief description of a model to mean that the detail described must be false. Also, I have described, for each inference problem, a model that parsimony must regard as false; don't assume that this is the only model that parsimony must reject when it addresses that inference problem.

Though each example requires that parsimony reject a process model, a model that parsimony needs to reject in one inference problem doesn't necessarily have to be rejected in another. In Example 1, parsimony requires a nontrivial assumption when the character is quantitative, but no such requirement is imposed when the trait is dichotomous. In Example 3, parsimony assumes that the two lineages experience different evolutionary processes. In Example 2, parsimony does not require this assumption; rather, it assumes that if the same process is at work in the two lineages, then that single process is drift. In Example 4, parsimony leaves open whether selection or drift is operating within a branch, but requires that different simultaneous branches be characterized by different pairs of values for u and v. These results suggest, but do not demonstrate, that parsimony may impose different assumptions about the evolutionary process when it addresses different inference problems.

Although I think these examples make clear at least some of the assumptions that parsimony makes about the evolutionary process, I have not commented on whether those assumptions are innocuous or implausible. I have emphasized that my arguments are predicated on the assumption that the Law of Likelihood, as formulated in (2*), is true. This is so general a proposition that it can hardly be said to be an assumption about evolution. Even so, if it is rejected, we are back to square one. If it is retained, the question becomes more specifically biological, but here again, there are choices to consider. For example, in problem 3, a likelihoodist may wish to maintain that the state of the fossil B provides more evidence about the state of the most recent common ancestor C than the extant organism A does. If so, parsimony's solution to this problem is mistaken. But it is open to the defender of parsimony to take the opposite

Table 3.1 Summary of the models discussed in this chapter that parsimony assumes to be false

Inference problem

A model that parsimony assumes is false

1

For a quantitative character evolving in a lineage, infer the character state of the ancestor when the descendant has character state a.

Selection is pushing the lineage towards an optimum that differs from the character state of the descendant.

2

When two descendants alive now exhibit different states of a dichotomous character, infer the character state of their most recent common ancestor.

The two lineages are characterized by the same pair of values for u and v, where uv.

3

When two descendants (one extant, the other a fossil) exhibit different states of a dichotomous character, infer the character state of their most recent common ancestor.

The two lineages are characterized by the same pair of values for u and v.

4

When two species share a symplesiomorphy not exhibited by a third, infer the rooted tree topology.

Simultaneous branches in the tree have the same pair of values for u and v.

(p.53) position; however, I think it is not enough just to insist that parsimony is right in what it says about this example and to conclude from this that the model that leads likelihood to a contrary verdict must be mistaken. An additional argument is needed concerning why the Markov process model should not be taken seriously. This point generalizes to the other examples. All these examples can be handled in the same way by attacking the entire Markov process framework. I don't say that this framework is beyond criticism. Rather, I suggest that criticisms of this framework must be biological in their content. This is an important point: once the Law of Likelihood is accepted, both criticisms and defenses of parsimony must be based on biological, not purely methodological, considerations.

I have not discussed the issue of statistical consistency in this essay, but there is a part of the debate about that matter that bears on the present discussion. Felsenstein (1978) described a model of evolution and an assumed true phylogeny that together lead parsimony to converge on a false phylogeny as the data are made large without limit. Farris' (1983/1994) reaction was to reject Felsenstein's model as unrealistic; after all, Felsenstein's model says that all traits in a given branch have identical transition probabilities and that the probability of reversion from the derived to the ancestral state is always zero. Felsenstein said at the outset that the model he describes is unrealistic; Farris emphatically agreed, and took this point to cancel whatever criticism of parsimony the demonstration of statistical inconsistency might be thought to imply. Farris apparently was reasoning that the correctness of parsimony requires parsimony to be statistically consistent; thus, if model M entails that parsimony is not consistent, then the correctness of parsimony requires that model M be false.11 I have reasoned similarly about the examples in this essay, except that I have focused on ordinal equivalence with respect to finite data sets, not on statistical consistency, which describes what will happen if you have an infinite data set. This difference aside, I am hardly the first to suggest that parsimony's failing to have some property elucidates what its biological assumptions are.

# 3.5 Acknowledgments

I thank Joe Felsenstein and Michael Steel for useful comments on the present paper. I also want to acknowledge the considerable influence that Steve Farris has had on my thinking about the role of parsimony in phylogenetic inference, and in a wider scientific context. It is a pleasure to contribute this chapter to a volume honoring his work.

## Notes:

(1) For the purposes of this chapter, I will treat ‘the maximum likelihood approach’ as an umbrella term that covers both frequentist and Bayesian implementations. The difference between them is discussed later.

(2) As an expository convenience, I represent H's average likelihood as a discrete summation, rather than as a continuous integration.

(3) Hulesenbeck et al. (2004) average over all of the many different time-reversible models by assigning them equal prior probabilities. Since some of these models are nested inside others, this prior distribution is questionable.

(4) Note that L(H&M) is a proposition, not a number between 0 and 1.

(5) In Sober (2004a) I argue that model selection criteria such as the Akaike information criteria (AIC) permit phylogenetic inference to proceed by considering any number of process models without one's having to commit to any of them.

(6) The two models agree that different traits on the same branch can have different probabilities of changing; this also applies to the same trait on different branches.

(7) Of the two occurrences of one-quarter in this expression, one of them is the prior probability of the root's being in a given state; the other is the probability of a change in state in the tree's interior.

(8) The same result holds when we pose this question about a star phylogeny or a bifurcating tree. If branches are conditionally independent of each other, the support for A = a (as measured by the likelihood ratio) is greater when there are several descendants than when there is just one.

(9) I do not conceptualize its maximum and minimum values (0 and 100) conceptualized as absorbing states. The same will be true in the drift model to be discussed shortly.

(10) See Lande (1976) and Sober (2005) for further details concerning these phenotypic models for selection and drift. Let me emphasize that my discussion of ‘drift’ in this paper is not about random genetic drift, but concerns change in the population average of a quantitative phenoetype.

(11) In Sober (1988, pp. 166–171), I argue that a method's statistical consistency is not a necessary condition for one to be rational in using that method. As it happens, there are parameter settings of the Tuffley and Steel (1997) no common mechanism model (N) that have the consequence that parsimony and likelihoodN both fail to be statistically consistent. However, I don't see why that forces one to decline to use N as one process model, possibly among several, in phylogenetic inference; for discussion of the use of multiple process models, see Sober (2004a). Furthermore, if using parsimony required the rejection of any model whose parameters can be assigned values that render parsimony statistically inconsistent, then the Tuffley–Steel model has both of the following properties: (1) it suffices for likelihood and parsimony to agree, and (2) its falsity is presupposed by parsimony. This illustrates how fundamentally different likelihood and statistical consistency are as tools for thinking about parsimony. Bayesians see the Law of Likelihood as fundamental; frequentists such as Felsenstein see consistency as the fundamental desideratum.