## Ziheng Yang

Print publication date: 2006

Print ISBN-13: 9780198567028

Published to Oxford Scholarship Online: April 2010

DOI: 10.1093/acprof:oso/9780198567028.001.0001

ContentsFRONT MATTER

# Simulating molecular evolution

Chapter:
(p. 293 ) 9 Simulating molecular evolution
Source:
Computational Molecular Evolution
Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780198567028.003.0009

# Abstract and Keywords

This chapter discusses basic techniques of computer simulation. Topics covered include random number generator, generation of continuous random variables, generation of discrete random variables, and simulating molecular evolution. Exercises are provided at the end of the chapter.

# 9.1 Introduction

Computer simulation, also known as stochastic simulation or Monte Carlo simulation, is a virtual experiment in which we mimic the biological process on a computer to study its properties. It is particularly useful for studying complex systems or problems which are analytically intractable. Use of random numbers is a major feature of the method. Some authors use the term Monte Carlo simulation when the answer is deterministic, as in calculation of an integral by Monte Carlo integration, and stochastic simulation or computer simulation when the answer involves random variations, as in studies of biases and variances of an estimation method. We ignore such a distinction here.

Simulation is a useful approach for validating a theory or program implementation when the method of analysis is complex. When the model is intractable analytically, simulation provides a powerful way of studying it. Indeed, as it is easy to simulate even under a complex model, there is no need to make overly simplistic assumptions. Simulation is commonly used to compare different analytical methods, especially their robustness when the underlying assumptions are violated, in which case theoretical results are often unavailable. Simulation also plays a useful role in education. By simulating under a model and observing its behaviour, one gains intuitions about the system. Lastly, Monte Carlo simulation forms the basis of many modern computation-intensive statistical methods, such as bootstrapping, importance sampling, and Markov-chain Monte Carlo (see Chapter 5).

A consequence of the ease of simulating and the availability of cheap and fast computers is that simulation is widely used and misused. Many simulation studies are poorly designed and analysed, with results largely uninterpretable. One should bear in mind that simulation is experimentation, and the same care should be exercised in the design and analysis of a simulation experiment as an ordinary experiment. While it is perhaps the simplest thing one can do with a model, simulation does offer the uninitiated ample opportunities for mistakes. A major limitation of simulation is that one can examine only a small portion of the parameter space, and the behaviour of the system may be different in other unexplored regions of the parameter space. One should thus resist the temptation of over-generalization. Analytical results are in general superior as they apply to all parameter values.

This chapter provides a cursory introduction to simulation techniques. The reader is invited to implement ideas discussed here. Any programming language, such as (p. 294 ) C/C++, Fortran, BASIC, Java, Perl, or Python, will serve the purpose well. The reader may refer to many of the textbooks on simulation, such as Ripley (1987) and Ross (1997).

# 9.2 Random number generator

Random variables from the uniform distribution U (0, 1), called random numbers, are fundamentally important in computer simulation. For example, to simulate an event (E) that occurs with probability 0.23, we draw a random number u ∼ U (0, 1). If u < 0.23, we decide that the event occurs; otherwise it does not (Fig. 9.1). It is obvious that the probability that u < 0.23 is 0.23. The U (0, 1) random numbers form the basis for generating random variables from other distributions.

A mathematical algorithm is used to generate a sequence of numbers between 0 and 1 that appear random. The algorithm is deterministic and will produce the same fixed sequence of numbers every time it is run. The generated numbers are thus not random at all, and are more appropriately called pseudo-random numbers but often simply random numbers. The mathematical algorithm is called a random-number generator. A typical simulation study may need millions or billions of random numbers, so it may be important to have a reliable and efficient random-number generator.

A commonly used algorithm is the multiplication-congruent method, given by the following formulae (9.1) (9.2) Here, Ai, c, and M are all positive integers: c is called the multiplier and M the modulus, with (a mod M) to be the remainder when a is divided by M. cAi−1 and Ai have the same remainder when divided by M and are said to be congruent. A0 is the initial value, known as the seed. A1, A2, … will be a sequence of integers between 0 and M – 1, and u1, u2, … will be a sequence of (pseudo-) random numbers.

Fig. 9.1 Sampling from a discrete distribution. (a) To simulate an event that occurs with probability 0.23, draw a random number uU (0, 1). If it falls into the interval (0, 0.23), the event occurs (E). Otherwise the event does not occur (Ē). (b) Similarly to sample a nucleotide with probabilities 0.1, 0.2, 0.3, and 0.4, for T, C, A, and G, respectively, draw a random number uU (0, 1) and choose the corresponding nucleotide depending on whether u falls into the four intervals of lengths 0.1, 0.2, 0.3, and 0.4.

(p. 295 ) As an example, imagine a computer using the familiar decimal system, with the register (storage) having only four digits. The computer can thus represent any integer in the range (0000, 9999). Let M = 104, c = 13, and A0 = 123. Equation (9.1) then generates the Ai sequence 1599, 6677, 1271, 6333, 8959, …, corresponding to the ui sequence 0.1599, 0.6677, 0.1271, 0.6333, 0.8959, …. Note that if the product cAi−1 is greater than 9999, we simply ignore the high digits to get the remainder Ai; most computer systems allow such ‘overflow’ without any warning. The sequence ui does not show an obvious trend and appears random. However, as this computer can represent only 10 000 integers and many of them will never occur in the sequence, very soon a certain number Ai will reoccur and then the sequence will repeat itself. A real computer, however, has more digits, and by intelligent choices of M and c (and, to a lesser extent, A0 as well), the period of the sequence can be made very long.

Real computers use the binary system. Thus the natural choice of M is 2d, often with d to be the number of bits in an integer (say, 31, 32, or 64). Note that such a choice of M eliminates the need for the division in equation (9.1) for the modulus calculation and also makes the division in equation (9.2) a trivial calculation.

How do we decide whether the random numbers generated are random enough. The requirement is that the numbers generated by the algorithm should be indistinguishable from random draws from the uniform distribution U (0, 1). This can be examined using statistical tests, examining different departures from the expectation. For example, the random numbers should have the correct mean (1/2) and variance (1/12), and should not be autocorrelated. In general, it is not advisable for us biologists to devise random-number generators. Instead one should use algorithms that have been well tested (Knuth 1997; Ripley 1987). Ripley (1987) discussed various tests and commented that random-number generators supplied by computer manufacturers or compiler writers were not trustable. It is unclear whether the situation has improved in the last two decades. For the exercises of this chapter, it is sufficient to use a generator provided by the programming language.

# 9.3 Generation of continuous random variables

Random variables used in a computer simulation are all generated using U (0, 1) random numbers. A very useful method is the transformation method, based on the fact that a function of a random variable is itself a random variable (see Appendix A). Thus we generate a random number u ∼ U (0, 1) and apply an appropriate variable transform, x = g(u), to generate a random variable x with the desired distribution.

An important transformation method is the inversion method. Suppose random variable x has a cumulative density function (CDF) F(x). Consider u = F(x) as a function of x so that u is itself a random variable. Its distribution is known to be uniform: u = F(x) ∼ U (0, 1). Thus, if the inverse transform x = F−1(u) is available analytically, one can use it to generate x using uniform random number u.

1. 1. Uniform distribution. To generate a random variable from the uniform distribution xU (a, b), one generates u ∼ U (0, 1) and apply the transform x = a + u(ba).

2. 2. Exponential distribution. The exponential distribution with mean θ has density function f (x) = θ−1ex/θ and CDF $F(x)=∫0xe−y/θdy=1−e−x/θ$. Let u = F(x). This function is easy to invert, to give x = F−1(u) = − θ log (1 − u). Thus we generate u ∼ U (0, 1) and then x = − θ log (1 − u) will have the exponential distribution with mean θ. Since 1 − u is also U (0, 1), it suffices to use x = −θ log (u).

3. 3. Normal distribution. Very often the CDF is not easy to invert. This is the case with the CDF of the normal distribution, which can only be calculated numerically. However, Box and Muller (1958) described a clever method for generating standard normal variables using variable transform. Let u1 and u2 be two independent random numbers from U (0, 1). then (9.3) are two independent standard normal variables. (The reader may wish to confirm this using Theorem 1 in Appendix A.) A drawback of the algorithm is that it involves calculation of expensive functions log, sin and cos. Also two random variables are generated even if only one is needed. From a standard normal variable, one can easily sample from the normal distribution N(µ, σ2); if zN (0, 1), then x = µ + zσ is the desired variable.

Random variables from other distributions such as the gamma and beta may be generated from special algorithms. These will not be discussed here; see Ripley (1987) and Ross (1997).

# 9.4 Generation of discrete random variables

## 9.4.1 Discrete uniform distribution

A random variable that takes n possible values (corresponding to n possible experimental outcomes, say), each with equal probability, is called a discrete uniform distribution. Suppose the values are 1, 2, …, n. To sample from this distribution, generate u ∼ U (0, 1) and then set x = 1 + [nu], where [a] means the integer part of a. Note that [nu] takes values 0, 1, …, n − 1, each with probability 1/n. As an example, the JC69 and K80 models predict equal proportions of the four nucleotides. Suppose we let 1, 2, 3, and 4 represent T, C, A, and G, respectively. To sample a nucleotide, we generate u ∼ U (0, 1) and then set x = 1 +[4u]. Another use of discrete uniform distribution is nonparametric bootstrap, in which one samples sites in the sequence at random.

## (p. 297 ) 9.4.2 Binomial distribution

Suppose the probability of ‘success’ in a trial is p. Then the number of successes in n independent trials has a binomial distribution: xBi (n, p). To sample from the binomial distribution, one may simulate n trials and count the number of successes. For each trial, generate uU (0, 1); if u < p, we count a ‘success’ and otherwise we count a failure.

An alternative approach is to calculate the probability of x successes (9.4) for x = 0, 1, …, n. One can then sample from the discrete distribution with n + 1 categories: (p0, p1, …,pn). This involves the overhead of calculating pxs but may be more efficient if many samples are to be taken from the same binomial distribution; see below about sampling from a general discrete distribution.

## 9.4.3 General discrete distribution

Suppose we want to sample a nucleotide at random with probabilities 0.1, 0.2, 0.3, 0.4 for T, C, A, G, respectively. We can break the line segment (0, 1) into four intervals: (0, 0.1), [0.1, 0.3), [0.3, 0.6), and [0.6, 1), corresponding to the probabilities (Fig. 9.1b). We then draw a random number u U (0, 1), and choose the nucleotide depending on which interval u falls into. Specifically, we compare u with 0.1, and choose T if u < 0.1. Otherwise we compare u with 0.3 and choose C if u < 0.3. Otherwise we compare u with 0.6 and choose A if u < 0.6. Otherwise we choose G. In general the comparison is repeated until u is smaller than the upper bound of the interval. It is obvious that the four nucleotides are sampled with the correct probabilities. Note that 0.1, 0.3, 0.6, and 1.0 are the cumulative probabilities, and that this algorithm is the discrete version of the inversion method:

 Category 1 (T) 2 (C) 3 (A) 4 (G) Probability 0.1 0.2 0.3 0.4 Cumulative probability (CDF) 0.1 0.3 0.6 1.0

We can sample from any discrete distribution in this way. However, if there are many categories and if many samples are needed this algorithm is very inefficient. One may rearrange the high-probability categories before the low-probability categories to reduce the number of comparisons. In the above example, it takes 1, 2, 3, or 3 comparisons to sample T, C, A, or G, respectively, with 0.1 × 1 + 0.2 × 2 + 0.3 × 3 + 0.4 × 3 = 2.6 comparisons to sample a nucleotide on average. If we order the nucleotides as G, A, C, T, it takes on average 1.9 comparisons. If there are many categories, one may use a cruder classification to find the rough location of the sampled category before doing the finer comparisons.

(p. 298 ) In general this inversion method is inefficient. A clever algorithm for sampling from a general discrete distribution with a finite number of categories is the alias method. It requires only one comparison to generate a random variable, no matter what the number of categories n is. This is explained below in Subsection 9.4.6.

## 9.4.4 Multinomial distribution

In a binomial distribution, every trial has two outcomes: ‘success’ or ‘failure’. If every trial has k possible outcomes, the distribution is called the multinomial distribution, represented as MN (n, p1, p2, …, pk), where n is the number of trials. The multinomial variables are the counts of the k different outcomes n1, n2, …, nk, with n1 + n2 + … + nk = n. One can generate multinomial variables by sampling n times from the discrete distribution (p1, p2, …, pk), and counting the number of times that each of the k outcomes is observed. If both n and k are large, it may be necessary to use an efficient algorithm to sample from the discrete distribution, such as the alias method described below.

Under most substitution models discussed in this book, the sequence data follow a multinomial distribution. The sequence length is the number of trials n, and the possible site patterns correspond to the categories. For nucleotide data, there are k = 4s categories for s species or sequences. One can thus generate a sequence data set by sampling from the multinomial distribution. See Subsection 9.5.1 below for details.

## 9.4.5 The composition method for mixture distributions

Suppose a random variable has a mixture distribution (9.5) where fi represent a discrete or continuous distribution while p1, p2, …,pr are the mixing proportions, which sum to 1. Then f is said to be a mixture or compound distribution. To generate a random variable with distribution f, first sample an integer (let it be I) from the discrete distribution (p1, p2, …, pr), and then sample from fI. This method is known as the composition method.

For example, the so-called ‘Γ+I’ model of rates for sites discussed in Subsection 4.3.1 assumes that a proportion p0 of sites in the sequence are invariable with rate zero while all other sites have rates drawn from the gamma distribution. To generate the rate for a site, first sample from the distribution (p0, 1 − p0) to decide whether the rate is zero or from the gamma distribution. Generate a random number u ∼ U (0, 1). If u < p0, let the rate be 0; otherwise sample the rate from the gamma distribution. Mayrose et al. (2005) discussed a mixture of several gamma distributions with different parameters; rates can be sampled in a similar way from that model. Also the codon substitution model M8 (beta&ω) discussed in Section 8.4 is a mixture (p. 299 ) distribution, which assumes that a proportion p0 of sites have the ω ratio drawnfrom the beta distribution while all other sites (with proportion 1 − p0) have the constant ωs > 1.

## *9.4.6 The alias method for sampling from a discrete distribution

The alias method (Walker 1974; Kronmal and Peterson 1979) is a clever method for simulating random variables from an arbitrary discrete distribution with a finite number of cells (categories). It is an example of the composition method discussed above. It is based on the fact that any discrete distribution with n cells can be expressed as an equiprobable mixture of n two-point distributions. In other words, we can always find n distributions q(m), m = 1, 2, …, n, such that $qi(m)$ is nonzero for at most two values of i, and that (9.6) The statement is proved below by construction when we describe how to construct those two-point distributions q(m). Table 9.1 shows an example with n = 10 cells. The target distribution pi is expressed as a mixture, each with weight 1/10, of 10 distributions: q(m), m = 1, 2, …, 10. The component distribution q(i) has nonzero probabilities on two cells: cell i (with probability Fi) and another cell Li (with probability 1 − Fi). Fi is called the cutoff and Li the alias. It is a feature of the distribution q(i) considered here that one of the two points is cell i. Thus the distribution q(i) is fully specified by Fi and Li. For example, F1 = 0.9 and L1 = 3 specify that the distribution q(1) has probability 0.9 on cell i = 1 and probability 1 − 0.9 = 0.1 on cell i = 3.

We will describe how to set up the component distributions q(m) or the F and L vectors in a moment. Now suppose they are already given. It is then easy to sample from pi: we generate a random integer k from (1, 2, …, n) and sample from q(k). The algorithm is shown in Box 9.1 (comments are in parentheses).

Step 1 uses a little trick to avoid the need for two random numbers: nu+1 is a random variable from U (1, n + 1), so that its integer part can be used for k and the decimal (p. 300 ) part for r, with k and r being independent. In our example with n = 10, suppose we get u = 0.421563. Then k = 4 + 1 = 5 and r = 0.21563. Step 2 then uses r to sample from the component distribution q(k); since r < F5, we get i = L5 = 6. The alias algorithm thus uses one random number and does one comparison to generate a random variable from the discrete distribution pi.

Table 9.1 Alias method for simulating a discrete distribution pi with n = 10 cells

i

1

2

3

4

5

6

7

8

9

10

Sum

pi

0.17

0.02

0.15

0.01

0.04

0.25

0.05

0.03

0.20

0.08

1

npi

1.7

0.2

1.5

0.1

0.4

2.5

0.5

0.3

2

0.8

10

q(2)

0.8

0.2

1

q(1)

0.9

0.1

1

q(4)

0.9

0.1

1

q(3)

0.5

0.5

1

q(5)

0.4

0.6

1

q(7)

0.5

0.5

1

q(6)

0.9

0.1

1

q(8)

0.3

0.7

1

q(10)

0.2

0.8

1

q(9)

1

1

Fi

0.9

0.2

0.5

0.1

0.4

0.9

0.5

0.3

1

0.8

Li

3

1

6

3

6

9

6

9

9

The target distribution (p 1, p2, …, p10) is expressed as an equiprobable mixture of 10 two-point distributions q(1), q(2), …, q(10). The component two-point distribution q(i), shown on the rows, has nonzero probabilities on two cells: cell i (with probability Fi) and another cell Li (with probability 1 − Fi). Cells with zero probabilities are left blank. For example, q(1) assigns probability 0.9 and 0.1 to cells 1 and 3 respectively, so that F1 = 0.9 and L1 = 3, shown on the bottom two rows. All information about the 10 component distributions is contained in the F and L vectors. Fi and Li are called the cutoff probability and alias. An algorithm for generating Fi and Li is described in the text, which produces the component distributions q(2), q(1), q(4), …, q(9), in the order shown in the table.

Now we describe how to set up the component distributions q(m), m = 1, 2, …, 10 or the F and L vectors. The solution is not unique, and any will suffice. Note that the sum of the probabilities for cell i across the n component distributions is npi (equation 9.6). For example, for the first cell i = 1, this sum is 10p1 = 1.7. The total sum across all cells is n. Our task is to divide up this sum of probabilities among n two-point distributions. Let Fi be the probability (or, more precisely, sum of probabilities) remaining in cell i. At the start, Fi = npi. As the end of the algorithm, Fi will hold the cutoff probability for cell i, as defined earlier.

We identify a cell j for which Fj < 1 and a cell k for which Fk ≥ 1. (Because $Σi=1nFi=n$, such j and k exist except when Fi = 1 for all i, in which case the job is done.) We then generate the distribution q(j), by using up all probability (Fj) on cell j, with the remaining probability needed, 1 − Fj, contributed by cell k. In our example, F2 = 0.2 < 1 and F1 = 1.7 > 1, so let j = 2 and k = 1. The distribution q(2) assigns (p. 301 ) probability 0.2 on cell j = 2 and probability 1 − 0.2 = 0.8 on k = 1. Thus F2 = 0.2 and L2 = 1 are finalized, and cell j = 2 will not be considered further in the algorithm. The remaining probability on cell k = 1 becomes F1= 1.7 − 0.8 = 0.9. At the end of this first round, the total remaining probability mass is n − 1, which will be divided up into n − 1 two-point distributions.

In the second round, we identify a cell j for which Fj < 1 and a cell k for which Fk ≥ 1. Since now F1 = 0.9 < 1 and F3 = 1.5 > 1, we let j = 1 and k = 3. Generate q(1) and let it use up all probability on cell j (that is, F1 = 0.9), with the rest of the probability (0.1) contributed by cell k = 3. Thus F1 = 0.9 and L1 = 3 for cell j = 1 are finalized. F3 = 1.5 − 0.1 = 1.4 is reset.

The process is repeated. Each round generates a two-point distribution q(j) and finalizes Fj and Lj for cell j, reducing the total remaining probability by 1. The process ends after a maximum of n steps. Box 9.2 summarizes the algorithm. It uses an indicator Ij to record the status of cell j: Ij = − 1, 1, or 0, if Fj < 1, Fj ≥ 1, or if cell j is not considered any further.

The alias method is efficient for generating many random variables from the same discrete distribution, as long as the F and L vectors have been set up. It requires only one comparison to generate one variable, irrespective of n. This contrasts with the inversion method, which requires more comparisons for larger n. Both the storage (for F, L, I) and the computation required in setting up the F and L vectors are proportional to n. It may not be worthwhile to use this algorithm to generate only a few random variables from the discrete distribution, but it is very useful for sampling from the multinomial distribution.

# (p. 302 ) 9.5 Simulating molecular evolution

## 9.5.1 Simulating sequences on a fixed tree

Here we consider generation of a nucleotide sequence alignment when the tree topology and branch lengths are given. The basic model assumes the same substitution process at all sites and along all branches, but we will also consider more complex models in which the evolutionary process may vary across sites or branches. We consider nucleotide models; amino acid or codon sequences can be generated using the same principles. Several approaches can be used and produce equivalent results.

### 9.5.1.1 Method 1. Sampling sites from the multinomial distribution of site patterns

If the substitution model assumes independent evolution at different sites in the sequence and all sites evolve according to the same model, data at different sites will have independent and identical distributions; they are said to be i.i.d. The sequence data set will then follow a multinomial distribution, with every site to be a sample point, and every site pattern to be a category (cell) of the multinomial. For a tree of s species, there are 4s, 20s, or 64s possible site patterns for nucleotide, amino acid, or codon sequences, respectively. Calculation of the probability for every site pattern is explained in Section 4.2, which describes the calculation of the likelihood function under the model. A sequence alignment can thus be generated by sampling from this multinomial distribution. The result will be the numbers of sites having the site patterns, many of which may be zero if the sequence is short. If a pattern, say TTTC (for four species), is observed 50 times, one simply writes out 50 sites with the same data TTTC, either with or without randomizing the sites. Most phylogeny programs, especially those for likelihood and Bayesian calculations, collapse sites into patterns to save computation, since the probabilities of observing sites with the same data are identical. As this simulation method generates counts of site patterns, those counts should in theory be directly usable by phylogeny programs.

The multinomial sampling approach is not feasible for large trees because the number of categories becomes too large. However, it is very efficient for small trees with only four or five species, especially when combined with an efficient algorithm for sampling from the multinomial, such as the alias method.

### 9.5.1.2 Method 2. Evolving sequences along the tree

This approach ‘evolves’ sequences along the given tree, and is the algorithm used in programs such as Seq-Gen (Rambaut and Grassly 1997) and EVOLVER (Yang 1997a). First, we generate a sequence for the root of the tree, by sampling nucleotides according to their equilibrium distribution under the model: πT, πC, πA, πG. Every nucleotide is sampled independently from the discrete distribution (πT, πC, πA, πG). If the base frequencies are all equal, one can use the algorithm for discrete uniform distributions, which is more efficient. The sequence for the root is then allowed to (p. 303 ) evolve to produce sequences at the daughter nodes of the root. The procedure is repeated for every branch on the tree, generating the sequence at a node only after the sequence at its mother node has been generated. Sequences at the tips of the tree constitute the data, while sequences for ancestral nodes are discarded.

To simulate the evolution of a sequence along a branch of length t, calculate the transition probability matrix P(t) = {pij(t)} (see Sections 1.2 and 1.5), and then simulate nucleotide substitutions at every site independently. For example, if a site is occupied by C in the source sequence, the nucleotide in the target sequence will be a random draw from the discrete distribution of T, C, A, G, with probabilities pCT(t), pCC(t), pCA(t), pCG(t), respectively. This process is repeated to generate all sites in the target sequence. As the transition probabilities apply to all sites in the sequence for the branch, one has to calculate them only once for all sites for that branch.

### 9.5.1.3 Method 3. simulating the waiting times of a Markov chain

This is a variation to method 2. One generates a sequence for the root, and then simulates the evolution of any site along any branch as follows. Suppose the branch length is t, and the rate matrix of the Markov chain is Q = {qij}. Let qi = −qii = Σji qijbe the substitution rate of nucleotide i. Suppose the site is currently occupied by nucleotide i. Then the waiting time until the next substitution event has an exponential distribution with mean 1/qi. We draw a random waiting time s from the exponential distribution. If s > t, no substitution occurs before the end of the branch so that the target sequence has nucleotide i at the site as well. Otherwise a substitution occurs and we decide which nucleotide the site changes into. Given that the site with nucleotide i has experienced a change, the probability that it changes into nucleotide j is qij/qi, and we can sample j from this discrete distribution. The remaining time for the branch becomes ts. We then draw a random waiting time from the exponential distribution with mean 1/qj. The process is repeated until the time for the branch is exhausted.

This simulation procedure is based on the following characterization of a continuous-time Markov chain (Fig. 9.2). The waiting time until the next transition (change) is exponential with mean 1/qi. If we ignore the waiting times between transitions, the sequence of states visited by the process constitutes a discrete-time Markov (p. 304 ) chain, which is called the jump chain. The transition matrix of the jump chain is given as (9.7) Note that every row sums to 1.

Fig. 9.2 Characterization of the Markov process as exponential waiting times and a jump chain. The waiting times until the next substitution s1, s2, and s3 are independent random variables from the exponential distributions with means 1/qT, 1/qC, and 1/qA, respectively, where qT, qC, and qA are the substitution rates of nucleotides T, C, and A, respectively.

The algorithm of simulating exponential waiting times and the jump chain may be applied to the whole sequence instead of one site. The total rate of substitution is the sum of the rates across sites, and the waiting time until a substitution occurs at any site in the whole sequence has an exponential distribution with the mean equal to the reciprocal of the total rate. If a substitution occurs, it is assigned to sites with probabilities proportional to the rates at the sites.

An advantage of this simulation procedure is that it does not require calculation of the transition-probability matrix P(t) over branch length t, as both the waiting times and the transition matrix for the jump chain are fully specified by the instantaneous rates. As a result, this procedure can be adapted to simulate more complex sequence changes such as insertions and deletions. One simply calculates the total rate of all events (including substitutions, insertions, and deletions) for the whole sequence, and simulates the exponential waiting time until the next event. If an event occurs before the end of the branch, one assigns the event to a site and to an event type (a substitution, insertion, or deletion) with probabilities in proportion to the rates of those events at the sites.

### 9.5.1.4 Simulation under more complex models

The methods discussed above can be modified to simulate under more complex models, for example, to allow different substitution parameters among branches (such as different transition/transversion rate ratio κ, different base frequencies, or different ω ratios). The multinomial sampling approach (method 1) applies as long as the site pattern probabilities are calculated correctly under the model. The approaches of evolving sequences along branches on the tree, either by calculating the transition probabilities (method 2) or by simulating the waiting times and the jump chain (method 3) are also straightforward; one simply use the appropriate model and parameters for the branch when simulating the evolutionary process along that branch.

One may also simulate under models that allow heterogeneity among sites. We will consider as an example variable substitution rates but the approaches apply to other kinds of among-site heterogeneity. There are two kinds of models that incorporate rate variation among sites (see Subsections 4.3.1 and 4.3.2). The first is the so-called (p. 305 ) fixed-sites model, under which every site in the sequence belongs to a predetermined site partition. For example, one may simulate five genes evolving on the same tree but at different rates r1, r2, …, r5. The transition probability matrix for gene k is pij(trk). Sites in different genes do not have the same distribution, but within each gene, the sites are i.i.d. Thus one can use either of the three methods discussed above to simulate the data for each gene separately and then merge them into one data set.

A second kind of heterogeneous-sites model are the so-called random-sites models. Examples include the gamma models of variable rates for sites (Yang 1993, 1994a), the codon models of variable ω ratios among sites (Nielsen and Yang 1998; Yang et al. 2000), and the covarion-like models (Galtier 2001; Huelsenbeck 2002; Guindon et al. 2004). The rates (or some other features of the substitution process) are assumed to be random variables drawn from a common statistical distribution, and we do not know a priori which sites have high rates and which have low rates. Data at different sites are i.i.d. The approach of multinomial sampling can be used directly, although the site pattern probabilities have to be calculated under the heterogeneous-sites model. One may also sample the rate for each site and apply the method of simulating evolution along branches. If a continuous distribution is used, one should in theory calculate the transition probability matrix P(t) for every site and every branch. If a few site classes are assumed (as in the discrete-gamma model), one may sample rates for sites first, and then simulate data for the different rate classes separately, perhaps followed by a randomization of the sites. Note that under the random-sites model, the number of sites in any site class varies among simulated replicates, whereas in the fixed-sites model, the number is fixed.

## 9.5.2 Generating random trees

Several models can be used to generate random trees with branch lengths: the standard coalescent model, the Yule branching model, and the birth–death process model either with or without species sampling. All those models assign equal probabilities to all labelled histories. (A labelled history is a rooted tree with the internal nodes ranked according to their ages.) One may generate a random genealogical or phylogenetic tree by starting with the tips of the tree, and joining nodes at random until there is one lineage left.

The ages of the nodes are independent of the labelled history and can be attached to the tree afterwards. Under the coalescent model, the waiting time until the next coalescent event has an exponential distribution (see equation 5.41). Under the birth– death process model, the node ages on the labelled history are order statistics from a kernel density (see equation 7.24) and can be simulated easily (Yang and Rannala 1997). Trees generated in this way have branch lengths conforming to a molecular clock. One may modify the substitution rates to produce trees in which the clock is violated.

One may also generate random rooted or unrooted trees without assuming any biological model, by sampling at random from all possible trees. Branch lengths may also be sampled from arbitrary distributions such as the exponential or the gamma.

# (p. 306 ) 9.6 Exercises

9.1 Write a small simulation program to study the birthday problem. Suppose that there are 365 days in a year and that one’s birthday falls on any day at random. Calculate the probability that at least two people out of a group of k = 30 people have the same birthday (that is, they were born on the same day and month but not necessarily in the same year). Use the following algorithm. (The answer is 0.706.)

1. 1. Generate k = 30 birthdays, by taking 30 random draws from1, 2, …, 365.

2. 2. Check whether any two birthdays are the same.

3. 3. Repeat the process 106 times and calculate the proportion of times in which at least two out of 30 people have the same birthday.

9.2 Monte Carlo integration (Subsection 5.3.1). Write a small program to calculate the integral f (x) in the Bayesian estimation of sequence distance under the JC69 model, discussed in Subsection 5.1.2. The data are x = 90 differences out of n = 948 sites. Use the exponential prior with mean 0.2 for the sequence distance θ. Generate N = 106 or 108 random variables from the exponential prior: θ1, θ2, …, θn, and calculate (9.8) Note that the likelihood f (x|θi) may be too small to represent in the computer, so scaling may be needed. One way is as follows. Compute the maximum log likelihood $ℓm=log⁡{f(x|θ^)}$, where $θ^=0.1015$ is the MLE. Then multiply f (x|θi) in equation (9.8) by a big number eℓm so that they are not all vanishingly small before summing them up; that is, (9.9)

9.3 Write a small simulation program to study the optimal sequence divergence when two sequences are compared to estimate the transition/transversion rate ratio κ under theK80 model. Assume κ = 2 and use a sequence length of 500 sites. Consider several sequence distances, say, d = 0.01, 0.02, …, 2. For each d, simulate 1000 replicate data sets under the K80 model and analyse it under the same model to estimate d and κ using equation (1.11). Calculate the mean and variance of the estimate $k^$ across replicate data sets. Each data set consists of a pair of sequences, which can be generated using any of the three approaches discussed in Subsection 9.5.1.

9.4 Long-branch attraction by parsimony. Use the JC69 model to simulate data sets on a tree of four species (Fig. 9.3a), with two different branch lengths a = 0.1 and b = 0.5. Simulate 1000 replicate data sets. For each data set, count the sites with the three site patterns xxyy, xyxy, xyyx, and determine the most parsimonious tree. To simulate a data set, reroot the tree at an interior node, as in, say, Fig. 9.3(b). Generate (p. 307 ) a sequence for the root (node 0) by random sampling of the four nucleotides, and then evolve the sequence along the five branches of the tree. You may also use the approach of multinomial sampling. Consider a few sequence lengths, such as 100, 1000, and 10 000 sites.

Fig. 9.3 (a) A tree of four species, with three short branches (of length a) and two long branches (with length b) for simulating data to demonstrate long-branch attraction. (b) The same tree rerooted at an ancestral node for simulation.

9.5 A useful test of a new and complex likelihood program is to generate a few data sets of very long sequences under the model and then analyse them under the same model, to check whether the MLEs are close to the true values used in the simulation. As MLEs are consistent, they should approach the true values when the sample size (sequence length) becomes larger and larger. Use the program written for Exercise 9.4 to generate a few data sets of 106, 107, or 108 sites and analyse them using a likelihood program (such as PHYLIP, PAUP, or PAML) under the same JC69 model, to see whether the MLEs of branch lengths are close to the true values. Beware that some programs may demand a lot of resources to process large data sets; save your important work before this exercise in case of a computer crash.