## Otso Ovaskainen, Henrik Johan de Knegt, and Maria del Mar Delgado

Print publication date: 2016

Print ISBN-13: 9780198714866

Published to Oxford Scholarship Online: August 2016

DOI: 10.1093/acprof:oso/9780198714866.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2016. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in OSO for personal use (for details see http://www.oxfordscholarship.com/page/privacy-policy). Subscriber: null; date: 08 December 2016

# (p.215) Appendix A Mathematical methods

Source:
Quantitative Ecology and Evolutionary Biology
Publisher:
Oxford University Press

# A.1 A very brief tutorial to linear algebra

In this section, we cover the very basics of matrix and vector algebra. Matrices and vectors are arrays of numbers, and they are usually denoted by bold letters, capitalized for matrices and non-capitalized for vectors. So let us define the example matrices A, B, and C, and the vectors x and y by

(A.1)
$Display mathematics$

The matrices $A$ and $B$ have 2 columns and 2 rows, so they are a $2×2$ matrices, whereas the matrix $C$ has 2 rows and 3 columns, so it is a $2×3$ matrix. Vectors are just special cases of matrices of which either of the dimensions is 1. Row vectors consist of a single row and thus have dimension $1×n$, whereas column vectors have a single column and thus have dimension $n×1$. For example, the column vector $x$ is a $2×1$ matrix. Numbers can also be considered as special cases of matrices, with dimension $1×1$, as is the case with $y$ in Eq. (A.1).

Two matrices that have the same dimensions (same numbers of rows and columns) are added by simply summing their elements, yielding e.g.

(A.2)
$Display mathematics$

Matrices are multiplied by a number by multiplying every element by that number, so that e.g.

(A.3)
$Display mathematics$

Two matrices can also be multiplied by each other, but matrix multiplication is different from element-wise multiplication. To illustrate, multiplying the matrix A by the matrix C yields D = AC, with

(A.4)
$Display mathematics$

For example, the element (1, 3) of the matrix $D$ has value $−1$. This can be derived by taking the row 1 of matrix $A$ (with values 2 and 1) and the column 3 of matrix $C$ (with values $−1$ and 1), multiplying these values element by element (giving $−2$ and 1), and then summing the values. More generally, assume that (p.216) the dimensions of the matrix $A$ are $n×k$, and that the dimensions of the matrix $C$ are $k×m$. These matrices can be multiplied together because the ‘inner dimensions’ match: the second dimension of the first matrix and the first dimension of the second matrix are both $k$. The dimensions of the product matrix $D=AC$ will be given by the ‘outer dimensions’, i.e. the first dimension of the first matrix and the second dimension of the second matrix. Thus, the dimensions of $D$ are $n×m$. The element $Dij$ is computed from the elements in the row $i$ of matrix $A$ and the column $j$ of matrix $C$. Both of these have $k$ numbers, which are multiplied element by element, and added together. As an equation, $dij=ai1c1j+ai2c2j+⋯+aikckj=∑l=1kailclj$.

Unlike with numbers, with matrices it does not generally hold that $AC=CA$, and therefore it does matter in which order matrices are multiplied by each other. For example, with the example matrices given in Eq. (A.1), we computed the product $AC$, but it is not even possible to compute the product $CA$ because the inner dimensions of the matrices do not match in this order.

One often needed matrix operation is that of the transpose, denoted by the superscript $T$. Transposing a matrix simply means switching the roles of rows and columns, so that e.g. the transpose of the matrix $C$ would be

(A.5)
$Display mathematics$

One common application of matrix algebra is solving a system of linear equations. Thus, we wish to find the solution $x$ to the equation $Ax=b$, where the matrix $A$ and the vector $b$ are given. For example, let $A$ be as given in Eq. (A.1), an let $b=(010)T$. Note that as shown in Eq. (A.5) we have used the transpose $T$ to convert the $1×2$ row vector $010$ into the $2×1$ column vector $(010)T$. The matrix equation $Ax=b$ corresponds to the system of the two linear equations

(A.6)
$Display mathematics$

where we wish to find the numbers $x1$ and $x2$ which solve the system. It is easy to see that the solution here is $x1=−1$ and $x2=2$, or in the matrix form $x=(x1x2)T=(−12)T$. In matrix form, the solution can be written as the product $x=A−1b$, where $A−1$ is called the inverse of the matrix $A$.

Another commonly needed matrix operation is that of finding eigenvalues and eigenvectors. These can be computed only for square matrices, i.e. $n×n$ matrices with an equal number of rows and columns. For such a matrix $A$, a number $λ$ and a $n×1$ column vector $x$ are said to form an eigenvalue–eigenvector pair if $Ax=λx$. For example, for the matrix $A$ given in Eq. (A.1), $λ=2$ and form an eigenvalue–eigenvector pair. To see this, we apply matrix multiplication to see that , which is the same as . A $n×n$ matrix generally has $n$ eigenvalue–eigenvector pairs. The reader may wish to verify that the second such pair for the example matrix $A$ is given by $λ=5$ and .

More precisely, solutions $x$ to the equation $Ax=λx$ are called the right eigenvectors. Vectors $y$ that satisfy $yTA=λyT$ are correspondingly called left eigenvectors. The eigenvalues $λ$ are the same in both cases but the left and right eigenvector are generally different. For example, the left eigenvector of the matrix $A$ associated with the eigenvalue $λ=2$ is , whereas the left eigenvector associated with the eigenvalue $λ=5$ is . The left eigenvectors of the matrix $A$ are the right eigenvectors of the matrix $AT$. We will not go further (p.217) here on how eigenvalues and eigenvectors can be computed—in practice this can be done with matrix algebra routines implemented in many programming languages.

Other commonly needed matrix operations include e.g. computing the determinant of a matrix, and various matrix decompositions such as the $LU$ decomposition and the Cholesky decomposition. These are explained in many basic textbooks of linear algebra, such as Poole (2014).

# A.2 A very brief tutorial to calculus

## A.2.1 Derivatives, integrals, and convolutions

A derivative measures the rate of change. If $f(t)$ is a function of time $t$, the derivative $df(t)/dt$, also denoted by $f′(t)$, tells how fast the function changes (increases or decreases) as time increases. Similarly, if $f(x)$ is a function of one-dimensional space $x$, then $f′(x)$ tells how the function changes (increases or decreases) as one moves across the space from left to right, i.e. in increasing direction of the $x$ coordinate. In Figure A.1, when moving from left to right, the function $f(x)$ (shown in Figure A.1A) first increases, and thus the derivative $f′(x)$ (shown in Figure A.1B) attains positive values in this part of the space. Then the function decreases (and thus the derivative is negative), then increases (and thus the derivative is positive), and finally decreases again (and thus the derivative is negative). In those locations where the function attains a local maximum or minimum, it does not increase nor decrease, and thus the derivative equals 0.

Figure A.1 Illustration of the concepts of derivative and integral. (A) shows an arbitrary function. Its first and second derivatives are shown in (B) and (C), respectively. Conversely, if thinking that (C) shows the function, (B) and (A) show its first and second integrals, respectively.

The second derivative $f′′(x)$ is simply defined as the derivative of the first derivative. Thus, it measures how fast the first derivate increases or decreases, as can be seen by comparing Figures A.1B and C as we just did for Figures A.1A and B. A comparison between Figures A.1A and C illustrates how the second derivative $f′′(x)$ measures the curvature of the function $f(x)$. If thinking of the function $f$ as e.g. the population density, the second derivative is positive in locations where population density is lower than in the neighbouring locations, whereas the second derivative is negative in locations where population density is higher than in the neighbouring locations. For example, in the middle of Figure A.1A (at $x=2$), the function $f(x)$ has a local minimum. At this location the function is convex, which corresponds to a positive second derivative, meaning that population density at $x=2$ is lower than on nearby locations. (p.218) Similarly, the third derivative $f″(x)$, also denoted by $f3(x),$ is defined as the derivative of the second derivative. More generally, the derivative of order $n$ is denoted by $f(n)(x)$.

Integral is the inverse of derivative, so e.g. the integral of $f″(x)$ is $f″(x)$, the integral of which is $f′(x)$, the integral of which is $f(x)$. We may write $f(x)=∫−∞xf′(y)dy+c$. The upper limit of the integration is $x$ and thus the value of a function at $x$ equals its derivative integrated from some lower limit up to that point. We have set here the lower limit to $−∞$, but it could have been set to any value. This is because knowing the derivative does not determine the function completely, only up to an integration constant, which we have denoted above by $c$. Changing the lower limit of the integration can be compensated for by changing the integration constant.

The area under a curve can be computed with an integral. For example, for the function $f$ of Figure A.1A, it holds that $∫04f(x)dx=1$, and thus the region between the x-axis and the curve has an area of 1.

In addition to derivation and integration, some ecological models involve a convolution term. Convolution does not apply to a single function but to a pair of functions. The convolution between the functions $f(x)$ and $g(x)$ is denoted by $(f∗g)(x)$, and it is defined by

(A.7)
$Display mathematics$

Figure A.2 gives examples of how the convolution of two functions behaves. If thinking of the function shown in Figure A.2A as the density (number per unit area) of individuals (say, trees) and the function of the Figure A.2B as the dispersal kernel (say, of seeds), the convolution shown in Figures A.2C describes the distribution of the seeds after dispersal. In this example the dispersal kernel $g$ is symmetric around 0, in which case the convolution produces a smoothing (i.e. a local average) of the function $f$. Note that in this example the density of seeds is not highest where the density of trees is highest. In Figure A.2D we have taken the convolution of the function $f$ with itself. As the function is not centred around 0, the convolution has also moved the function in space. As we will discuss in Appendix A.3, one application of convolutions relates to computing the distribution for the sum of two random variables.

Figure A.2 Illustration of the concept of convolution. (A) and (B) show two arbitrary functions, denoted by $f(x)$ and $g(x)$, respectively. (C) shows the convolution $f∗g(x)$ and (D) the convolution $f∗f(x)$.

## A.2.2 Differential equations

Differential equations are routinely used in many kinds of models in ecology and evolution. As a very simple example, assume that the population consists initially (at time $t=0$) of a single individual, and that the population increases at a constant rate 2, meaning that 2 new individuals arrive per time unit to the population. The mathematical description of this problem can be written as the differential equation $f′(t)=2$ and the initial condition $f0=1$. The solution to this problem is $f(t)=1+2t$, and thus the population density increases linearly with time, with the slope 2. To see that this function indeed solves the problem, we first note that $f0=1+2×0=1$ and thus the proposed solution satisfies the initial condition. Further, by computing the derivative of the function $f(t)=1+2t$ with respect to time $t$ yields $f′(t)=2$, and thus the proposed solution satisfies also the differential equation.

As a side note, this model predicts that e.g. at time $t=0.25$ the population will consist of 1.5 individuals, which is clearly not feasible. This is because differential equations are deterministic models defined in a continuous state space (see Table 1.1), and consequently they are best suited for modelling large populations not influenced by stochastic fluctuations. If having initially 1,000 individuals instead of a single individual, there would be 1,500 individuals at time 0.25. While the model would still predict non-integer values, approximating 1,501 individuals by 1,500.5 individuals is more accurate than approximating 2 individuals by (p.219) 1.5 individuals. Another option is to consider the differential equation as modelling the expected value of a stochastic process: if starting with a single individual at time $t=0.0$, then at $t=0.25$ the population may consist of 1, sometimes of 2 (or 3 or more) individuals, with 1.5 being the expected value. In Appendix A.3 we will discuss Markov processes which can be seen as stochastic counterparts of differential equations.

In the example just described, we assumed that the arrival rate of new individuals is independent of the present size of the population. Let us then assume that new individuals appear because the individuals that make up the present population produce offspring. If the per capita rate (number per unit time) by which the present individuals produce offspring is 2, then the growth rate of the population is $2f(t)$, where $f(t)$ is the current size of the population. This reasoning leads to the differential equation $f′(t)=2f(t)$. The solution to this problem is $f(t)=e2t$. To see that this is indeed the solution, one needs to check that $f(t)=e2t$ satisfies the initial condition as well as the differential equation. Clearly, $f0=1$, and thus the initial condition holds. Further, $f′(t)=2e2t$, and thus also the differential equation $f′(t)=2f(t)$ holds. Note that now the solution $f(t)=e2t$ does not correspond to linear growth but to exponential growth.

Differential equations can also involve higher derivatives. For example, consider the second-order differential equation $f″(t)=2f′(t)+3f(t)$, where the highest order derivative (p.220) (here, second derivative) is put to the left-hand side (this is what is being modelled), and the right-hand side involves lower order derivatives (here, the function itself and the first derivative). A second-order differential equation needs to be supplemented with two initial conditions, one for the function and one for its derivative, e.g. $f0=1$ and $f′0=7$. As an exercise, the reader may wish to verify that the solution to this problem is $f(t)=2e3t−e−t$. To do so, one needs to check that the proposed function $f(t)$ satisfies both of the initial conditions as well as the differential equation. For the latter, simply compute the value of the left-hand side $f″(t)$ and the value of the right-hand side $2f′(t)+3f(t)$, and note that they are indeed the same.

## A.2.3 Systems of differential equations

Many ecological models are written as coupled systems of differential equations. Examples involve e.g. multi-species models (say, predator–prey model, with one equation for each species) and metapopulation models (with one equation for each patch). The mathematically simplest version is that of a linear system of autonomous differential equations. With the help of the matrix notation introduced in Appendix A.1, such a system can be written as $y′(t)=Ay(t)$, where the variable $t$ is usually time. The equation needs to be supplemented with an initial condition, which can be written as $y(t)=y0$. This is a linear system because $Ay(t)$ is a linear function of $y$. It is an autonomous system because $Ay(t)$ depends on the variable $t$ only through $y(t)$, i.e. the matrix $A$ is independent of time.

Autonomous linear systems of differential equations can be solved with a standard procedure. This consists of making an ‘ansatz’ (an educated guess which will later be verified to hold) that the solution is of the form $y(t)=xeλt$. So let us substitute this ansatz to the differential equation $y′(t)=Ay(t)$ and see what happens. The derivative of the ansatz is given by $y′(t)=λxeλt$, and the matrix product is given by $Ay(t)=Axeλt$. Equating these two gives $λxeλt=Axeλt$. This will be true for all times $t$ if and only if $Ax=λx$. Thus, we have shown that the ansatz $y(t)=xeλt$ is a solution to the differential equation, if $(λ,x)$ is an eigenvalue–eigenvector pair of the matrix $A$ (see Appendix A.1). In general, if $A$ is a $n×n$ matrix, it will have $n$ eigenvalue–eigenvector pairs, which we may denote by $(λi,xi)$ for $i=1,…,n$. As the system of differential equations is linear, not only all the $n$ functions $yi(t)=xieλit$ satisfy the differential equations, but also any linear combination $y(t)=∑i=1ncixieλit$. The initial condition $y(0)=y0$ is needed to determine the weights $ci$ of the linear combination. As for $t=0$ it holds that $eλit=1$, we obtain $y0=∑i=1ncixi$. Denoting by $X$ the $n×n$ matrix which has the eigenvectors $xi$ as its columns, we can write the initial condition in matrix notation as $Xc=y0$, where $c$ is the $n×1$ column vector with elements $ci$. The solution can be written $c=X−1y0$, where $X−1$ is the inverse of the matrix $X$ (see Appendix A.1).

As an example, let $A$ be the $2×2$ matrix

(A.8)
$Display mathematics$

and let $y0=(−1,1)T$. Denoting the two components of the vector $y(t)$ by $y1(t)$ and $y2(t)$, the system of differential equations can also be written as (see Appendix A.1)

(A.9)
$Display mathematics$

with the initial condition $y10=−1$, $y2(t)=1$. The eigenvalues of the matrix $A$ are $λ1=2i$, $λ2=−2i$, where $i$ is the imaginary unit with $i2=−1$. It may seem confusing that the eigenvalues are complex numbers, as the original system of differential equations had nothing to do (p.221) with complex numbers! But there is nothing wrong here, the complexity of the eigenvalues just indicates that the solution will oscillate in time (this relates to what is called Euler’s formula: $eia=cosa+isina$). After computing the eigenvalue–eigenvector pairs and solving $c=X−1y0$, the solution can be presented as

(A.10)
$Display mathematics$

Reassuringly, the imaginary parts have cancelled out from the solution, and thus we needed complex numbers only as an intermediate step. The dynamics of this system are illustrated in Figure A.3. They resemble predatory–prey dynamics discussed in Section 4.2.

Figure A.3 The solution (Eq. (A.10)) to the system of differential equations given by Eq. (A.9). The function $y1(t)$ is shown by continuous line and the function $y2(t)$ by dashed line.

Systems of differential equations used to model ecological or evolutionary dynamics will often be non-linear and non-autonomous. For the general case, a system of first-order differential equations can be written as $y′(t)=f(y(t),t)$, where $f$ is some vector-valued function which may depend in any non-linear manner on the current state of the system $y(t)$, as well as on time $t$. Unlike for linear and autonomous systems, there is no general recipe on how to solve this equation, and one often needs to resort to numerical solutions. However, specific properties of the system (most importantly, local stability of equilibriums solutions) can often be found analytically by linearizing the system. We refer the interested reader to the many textbooks covering these topics (e.g. Swokowski 2000).

## A.2.4 Partial differential equations

Sometimes differential equations involve derivatives both with respect to time and space. In such a case, they are called partial differential equations, to be separated from ordinary differential equations which involve derivatives only with respect to one variable. As an example, consider the diffusion equation (Eq. (2.2)) of Chapter 2, simplified here to a single spatial dimension as

(A.11)
$Display mathematics$

(p.222) The state variable in this equation, denoted by $f(x,t)$, is a function of both time $t$ and spatial location $x$. In the context of Chapter 2, $f(x,t)$ would model the probability density of an individual’s location. The symbol $∂$ that appears on both sides of the equation stands for a partial derivative. In the left-hand side the partial derivative is written as $∂/∂t$, and it measures the rate of change with respect to time $t$. Thus, the left-hand side of the equation asks how fast the function $f$ increases or decreases with time. The right-hand side of the equation gives the answer: the rate of change is proportional to the second spatial derivative. Note that in the right-hand side, the partial derivative is taken not with respect to time $t$ but with respect to the spatial variable $x$, and it is not the first but the second partial derivative, as indicated by the superscripts 2 in $∂2/∂x2$. Based on the interpretation of the second derivative, Eq. (A.11) says that the function $f$ will increase in locations where it currently has a smaller value than in neighbouring locations, and that it will decrease in locations where it currently has a higher value than in neighbouring locations. Therefore, the function becomes smoother than it was originally, as peaks become flattened out and dips become filled.

A partial differential equation also needs to be supplemented by an initial condition $fx,0=f0(x)$. Unlike with ordinary differential equations, the initial condition is not a single value, but a function $f0(x)$ which describes the initial state of the function $f$ in all spatial locations $x$.

## A.2.5 Difference equations

The difference between differential and difference equations is that difference equations are defined for discrete time, $t=0,1,2,…,$ while differential equations are defined with respect to continuous time. For example, a discrete-time analogy for the differential equation $f′(t)=2f(t)$ would be the difference equation $ft+1=2f(t)$. This equation tells that every time-step the number of individuals in the population doubles. Starting again with a single individual, $f0=1$, the solution to this problem is $f(t)=2t$. Like the solution of the corresponding differential equation, this also corresponds to exponential growth. As in this book we mainly focus on continuous-time models, we have focused on differential equations, but very similar mathematical tools exist also for difference equations. Often the same biological problem can be modelled almost identically by a differential or a difference equation, and thus the choice between these two is in many cases somewhat arbitrary. However, in other cases these two approaches can give qualitatively different answers.

# A.3 A very brief tutorial to random variables

Differential equations and difference equations (Section A.2) are examples of deterministic models. This means that ‘running’ the same model repeatedly always leads to the same solution. Many, if not all, ecological and evolutionary phenomena are, however, not of deterministic but of stochastic nature, meaning that running exactly the same model twice (e.g. repeating the same experiment twice) usually results in different outcomes. Such random variation can be captured by stochastic processes. A central concept behind stochastic processes is that of a random variable, a concept that is also needed for understanding the mathematical foundations of statistics that we discuss in Appendix B.

## A.3.1 Discrete valued random variables

Random variables are usually denoted by capital letters such as $X$. An example of a random variable $X$ could be the outcome of a toss of a coin or a die. In the case of the die, there are (p.223) six possible outcomes, $X∈(x1,x2,…,x6)$, which simply correspond to the number of dots (called pips) on the six faces of the die. Thus, in this example we may write $x1=1,…,x6=6$, or in a more general notation $xk=k$, where $k=1,2,3,4,5,6$. Assuming a fair (i.e. unbiased) die, the probabilities of each of these outcomes are identical, so that $pk=1/6$ for $k=1,…,6$. $X$ is a discrete valued random variable, as the outcomes form a discrete set. The probability distribution of $X$ can be described simply by listing the probabilities of all possible outcomes, as we have just done.

One of the most basic properties of a random variable is its expectation, denoted by $E(X)$. The expected value is defined as the mean value of the outcomes, weighted by their probabilities, $E(X)=∑kpkxk$. For the die example, the expected result, also called the mean or the average, is $E(X)=16+26+⋯+66=72=3.5$. If throwing the die 1,000 times, the sum of the values should be close to 3,500. In addition to the expectation, another routinely used statistic is variance, which measures the amount of variability among the outcomes. Variance is defined as $Var(X)=EX−E(X)2=∑kpkxk−E(X)2$. In this formula, $X−E(X)$ asks how far the random variable $X$ is from its expectation $E(X)$. Variance is thus defined as the expectation of the squared difference between the random variable and its expectation. In the case of the die example, we obtain

$Display mathematics$

Random variables can have infinitely many possible outcomes. As one example, let us consider a random variable for which the possible outcomes are all the non-negative integers $k=0,1,2,….$ Let us assume that the first outcome ($k=0$) takes place every second time, i.e. with probability $p0=1/2$. Let us assume that the second outcome ($k=1$) takes place in every second case, conditional on $k=0$ not taking place, and thus with probability $p1=1/2×1/2=1/4$. Assuming that the third possible outcome ($k=2$) would again hold every second time, conditional on $k=0$ or $k=1$ not taking place, we have $p2=1/2×1/4=1/8$. Continuing similarly, we obtain for the general case $pk=1/2k+1$. The probabilities $pk$ are illustrated in Figure A.4A. They describe the probability density function of a random variable, which follows the geometric distribution with parameter value 1/2. For any discrete probability distribution such as the geometric distribution, summing up the probabilities of all possible outcomes gives one, i.e. $∑kpk=1$.

Figure A.4 Representing random variables through probability distributions. (A, B) illustrate a discrete valued random variable (geometric distribution with parameter 1/2), whereas (C, D) illustrate a continuous valued random variable (normal distribution with mean 0 and variance 1). (A, C) show probability density functions and (B, D) cumulative distribution functions.

Applying the formulae of expectation and variance described two paragraphs earlier, we find that these are for the geometrically distributed (with parameter ½) random variable $E(X)=∑kpkxk=∑kk/2k+1=1$, and $Var(X)=∑kpkxk−E(X)2=∑kk−12/2k+1=2$. We note that for this distribution, we were able to compute the values of the infinite sums exactly, but in a more general case one may need to compute them numerically.

## A.3.2 Continuous valued random variables

Let us then move to random variables with a continuous state-space. An archetypal example of such a random variable is given by the normal distribution $N(μ,σ2)$ with mean $μ$ and variance $σ2$. The notation $X∼N(μ,σ2)$ is to be read as ‘the random variable $X$ is distributed according to the normal distribution with mean $μ$ and variance $σ2$’. Let us consider for simplicity the so-called standard normal distribution $N(0,1)$, with mean 0 and variance 1. The random variable $X$ can obtain any real value. Thus, if we sample once from the distribution, we may obtain $X=0.2311$, whereas the next sample may be $X=−24.5921$. While any value is possible, not all values are equally likely. The relative probabilities of different values are described by the probability density function of the distribution. In the case of $N(0,1)$, the (p.224) probability density function (illustrated in Figure A.4C) is $p(x)=e−x2/2/2π$. The value of the probability density at the most likely value of $x=0$ is $p(0)=1/2π≈0.40$, whereas a value as far from 0 as $−24.5921$ is extremely unlikely, $p−24.5921≈1.9×10−132$.

In the case of a discrete valued random variable, the values of the probability density function are probabilities which sum to 1. But in the case of a continuous random variable, there is a continuum of possible values. Thus, we cannot say that the most likely value $x=0$ is sampled with probability 0.4. Even if the sampled value could be $x=0$, mathematically the probability of sampling this value is 0, as is the probability of sampling any specific single value. With continuous valued random variables, we may ask with what probability the random variable will fall within a specific range. The answer to this question is obtained by integrating the probability density function over that range. For example, as the area under the probability density curve within the range $−1 is $∫−11p(x)dx=∫−11(e−x2/2/2π)dx≈0.683$, the fraction of cases where the random variable sampled from $N(0,1)$ will have its value within this range is 68.3%.

The probability by which the random variable will obtain a value that is at most a given threshold $x$ is called the cumulative distribution function, often denoted by $F(x)$. (p.225) For continuous random variables, $F(x)$ is defined by $F(x)=∫−∞xp(y)dy$, whereas for discrete random variables it is defined by $F(x)=∑y=−∞xpy$. Note that inside the integral (or sum) we have used another name for the variable ($y$) because $x$ is reserved for the integration (summation) limit.

In the case of discrete valued random variables, the probabilities must sum to 1. Similarly, in the case of a continuous valued random variable, the probability density function needs to integrate to 1, i.e. $∫−∞∞p(x)dx=1$. These statements can be expressed with the help of the cumulative distribution function as $F(∞)=1$. Figure A.4 illustrates how the cumulative distribution functions of the geometric and the normal distributions converge to 1 as $x$ increases. Note that the probability density function of the standard normal distribution involves a factor $1/2π$. This factor is called the normalizing constant, as it has been chosen so that $F(∞)=1$.

The expected value of a continuous valued random variable is defined like that for a discrete valued random variable, except replacing the summation by an integral. Thus, $E(X)=∫−∞∞p(x)xdx$, where each possible value $x$ of the random variable $X$ is weighted by its probability density $p(x)$. For $X∼N(0,1)$, it holds that $E(X)=∫−∞∞(e−x2/2/2π)xdx=0$. Similarly, the variance of a continuous valued random variable is defined by $Var(X)=∫−∞∞p(x)x−E(X)2dx$, yielding for the $X∼N(0,1)$ $Var(X)=∫−∞∞(e−x2/2/2π)x2dx=1$. Note that we derived that the mean and variance of a $N0,1$ random variable are indeed 0 and 1.

## A.3.3 Joint distribution of two or more random variables

Often it is interesting to consider two or more random variables at the same time. Denoting two random variables by $X$ and $Y$, we may consider their joint distribution, i.e. the probabilities (or values of the probability density) of all possible combinations of their outcomes $(x,y)$. Covariance and correlation measure the degree by which the two random variables depend on each other. The covariance is defined as $Cov(X,Y)=E[(X−E(X))(Y−E(Y))]$, and correlation is obtained by scaling the covariance by the variances, $Cor(X,Y)=Cov(X,Y)/Var(X)Var(Y)$. While covariance can obtain any value from $−∞$ to $+∞$, correlation is restricted to the range from $−1$ to 1. Two variables are statistically independent if their covariance, and thus correlation, is 0.

To illustrate the concept of covariance, let us consider the multivariate normal distribution. The notation $X∼N(μ,Σ)$ means that the $n×1$ random vector $X$ follows a multivariate normal distribution with mean $μ$ (a $n×1$ vector) and variance-covariance matrix $Σ$ (a $n×n$ matrix). As an example, let us consider the bivariate case ($n=2$) with

(A.12)
$Display mathematics$

In this example, the vector $X$ has two elements, $X=X1X2T$. The first one of these is distributed as $N(−2,4)$ and the second one as $N(1,25)$, where the means and the variances of the univariate normal distributions are obtained from the mean vector $μ$ and the diagonal elements of the variance-covariance matrix $Σ$. The covariance 7 between the two random variables can be translated into the correlation of $7/4×25=0.7$. The dots in Figure A.5A illustrate 200 random deviates sampled from this distribution, and the ellipses the 50 and 95% quantiles of the probability distribution. The mean of the distribution is centred at $μ$, and the positive correlation assumed between the two variables is evident in the figure. Figure A.5B illustrates otherwise the same case but the covariance has been changed to $−7$, leading to a negative correlation of $−0.7$.

Figure A.5 Illustration of a bivariate probability distribution. In (A), the dots show 200 random deviates sampled from the bivariate normal distribution of Eq. (A.12), whereas the lines show the 50 and 95% quantiles of this distribution. (B) is otherwise identical but the covariance has been changed to $−7$.

## (p.226) A.3.4 Sums of random variables

Expectations, variances, and covariances have many useful properties. For example, one may wish to study the properties of a random variable $Z$ obtained as a weighted sum $Z=aX+bY$ of two random variables $X$ and $Y$, where the weights $a$ and $b$ are real numbers. The expectation and variance of $Z$ can be computed from those of the original variables as $E(Z)=aE(X)+bE(Y)$, and $Var(Z)=a2Var(X)+b2Var(Y)+2abCov(X,Y)$. Similarly, if $Z=aX+bY$ and $W=cP+dQ$, then

(A.13)
$Display mathematics$

If the random variables $X$ and $Y$ are independent of each other, the covariance is 0, and thus the formula for the variance simplifies to

(A.14)
$Display mathematics$

In addition to expectation and variance, it is possible to compute the entire probability distribution for the (possibly weighted) sum of two random variables. To illustrate, let us consider the example of throwing the die (see Section A.3.1) but now doing it twice, with the first and second outcomes called $X$ and $Y$. The sum of these two outcomes $Z=X+Y$ may obtain any value between 2 and 12. For example, the value of 5 will be obtained by the $(X,Y)$ combinations $(1,4)$, $(2,3)$, $(3,2)$ and $(4,1)$. As each of these takes place with probability 1/36, we obtain $PrZ=5=4/36=1/9$. More generally, if we denote by $pX$ and $pY$ the probability distributions of the random variables $X$ and $Y$ (which in the die example are identically distributed, but that more generally could be different), we have

(A.15)
$Display mathematics$

Equation (A.15) is simply based on going through all the cases for which the sum of the two random variables obtains the specific value $z$. This happens if the first variable $X$ obtains any (p.227) value $k$, and the second variable $Y$ obtains the value $z−k$, which makes the sum equal $z$, as we just illustrated for the die example for $z=5$.

We note that Eq. (A.15) holds only if the random variables $X$ and $Y$ are independent of each other. To see this, let us not throw the second die at all, but decide to set the value of the second die to the value of the first die. In this case, the outcome of the second die $Y$ is still a random variable, as it obtains each of the values from 1 to 6 with probability 1/6. However, it is not independent from the outcome of the first die $X$. Now the possible outcomes of $Z$ are 2, 4, 6, 8, 10, and 12, and each of these takes place with probability 1/6.

Equation (A.15) holds also for continuous valued random variables if replacing the sum by an integral. Thus assuming that $X$ and $Y$ are independent continuous valued random variables, the probability density function of $Z=X+Y$ is given by

(A.16)
$Display mathematics$

where the symbol $∗$ denotes convolution (see Section A.2). As an example, let $X∼Uniform(0,1)$ be an uniformly distributed random variable in the range from 0 to 1. Let also the random variable $Y∼Uniform(0,1)$ follow the same distribution. Evidently, the sum $Z=X+Y$ may obtain any value from 0 to 2. At first sight one might expect that $Z$ could obtain any value from 0 to 2 equally likely, i.e. that $Z∼Uniform(0,2)$. This would actually be the case if we would make the two random variables fully correlated by sampling only $X$ and then setting $Y=X$, as we did in the case of the die example where we set the value of the second die to that of the first die.

To compute the distribution of $Z$ when $X$ and $Y$ are independent, let us apply Eq. (A.16). With $X∼Uniform(0,1)$, the probability density function is $pX(x)=1$ if $0 and $pX(x)=0$ otherwise. As $Y$ is identically distributed, it has the same probability density function. Thus we obtain

(A.17)
$Display mathematics$

In Eq. (A.17) we have first utilized the fact that $pX(x)=1$ for $0 and $pX(x)=0$ otherwise. We have then utilized the fact that $0 if and only if $z−1. As the integrand is just the constant 1, the value of the integral is the length of the integration interval, and therefore $pZ(z)=min(z,1)−max(z−1,0)$. Thus, if $0, we have $pZ(z)=z$. If $1, we have $pZ(z)=1−z−1=2−z$. This probability density looks like a triangle: the most likely value for the sum is $Z=1$, and the value of the probability density decreases linearly to 0 if approaching the extremes of 0 or 2. Informally, there are more combinations of $X$ and $Y$ which produce the sum of $Z=1$ than e.g. the sum of $Z=0.1$. Note that this is analogous to the earlier discussion on the sum of two throws of the die, for which case e.g. $PrZ=5=4/36$ is greater than $PrZ=2=1/36$.

A special case often needed in the context of hierarchical generalized linear models (Section B.1) is the sum of two normally distributed variables. Assume that $X∼N(μX,σX2)$ and that $Y∼N(μY,σY2)$, and let $Z=X+Y$. Assuming that $X$ and $Y$ are independent, it holds that $Z∼N(μX+μY,σX2+σY2)$. Thus, the sum of two normally distributed random variables is also normally distributed, the expected value being the sum of the expectations, and the variance being the sum of the variances.

## (p.228) A.3.5 An application of random variables to quantitative genetics

As an illustration of how to work with random variables, we follow here Appendix A of Ovaskainen et al. (2011) to derive Eq. (5.4) in the main text. Our starting point is Eq. (5.1), which models the breeding value of an individual $i$ as

(A.18)
$Display mathematics$

We consider the allelic states $xijku$ as random variables (even if they are denoted by non-capitalized letters), and the allelic effects $vju$ as fixed. By the covariance formula for sum of random variables (Eq. (A.13)), we obtain

(A.19)
$Display mathematics$

We then apply the definition of covariance to write

(A.20)
$Display mathematics$

We denote the frequency of the allele $u$ of gene $j$ in the ancestral population by $pju$. As we assume that the genes under consideration are neutral, the expected allele frequency for any individual is equal to that in the ancestral population, and thus $Exijku=pju$. Recall that $xijku$ is an indicator function taking the value 1 for one of the allelic variants $u$ and 0 for all other allelic variants. Thus, while one might at first think that $Exijkuxijku=pju2$, it actually holds that $Exijkuxijku=pju$. For the same reason, it holds that $Exijku1xijku2=0$ for $u1≠u2$. The value of the product $xi1jk1uxi2jk2u$ equals 1 if in gene $j$ the allele $k1$ of the individual $i1$ is of the same type u as the allele $k2$ of the individual $i2$, whereas the product equals 0 if this is not the case. This means that the probability by which the allele $k1$ of the individual $i1$ is of the same type u as the allele $k2$ of the individual $i2$ is given by $Exi1jk1uxi2jk2u$. Therefore, the probability that two randomly chosen alleles in a gene $j$ are of the same type $u$ for the individuals $i1$ and $i2$ is given by $∑k1,k2=12Exi1jk1uxi2jk2u/4$. This probability can be decomposed into two components. First, the alleles may be identical by descent, the ancestral allele being of type u. Given the definition of coancestry coefficients (see Section 5.2), this happens with probability $θi1i2pju$. The second option is that the alleles are not identical by descent, but that the alleles in the ancestral generation were both of type $u$. As the probability of the latter case is $(1−θi1i2)pju2$, we obtain

(A.21)
$Display mathematics$

As the individuals $i1$ and $i2$ can have different allelic variants only if the alleles are not identical by descent (recall from Section 5.2 that we have ignored mutations), we obtain for $u1≠u2$

(A.22)
$Display mathematics$

Equation (A.22) can be written as

(A.23)
$Display mathematics$

To simplify Eq. (A.23), we denote by $hju1u2=δu1u2pju1−pju1pju2$, where $δu1u2$ is Kronecker’s delta, which obtains the value of 1 if $u1=u2$ and the value of 0 if this is not the case. In this notation, we have

(A.24)
$Display mathematics$

(p.229) Substituting this expression into Eq. (A.19) yields

(A.25)
$Display mathematics$

which equals Eq. (5.4), if we measure the amount of additive variation $VA$ present in the ancestral generation by

(A.26)
$Display mathematics$

# A.4 A very brief tutorial to stochastic processes

After the discussion of random variables, we are ready to move to stochastic processes. A stochastic process is a random variable that depends on time. Thus, instead of a single random variable $X$, we consider a collection of random variables $X(t)$ indexed by time $t$. Time can be continuous or discrete, and the state-space (i.e. the set of values which the random variable can attain) can be continuous or discrete as well.

## A.4.1 Markov chains

Let us start from the discrete time setting $(t=0,1,…)$, and consider the simplest possible example in which the state space involves only two states, so that the random variable $X(t)$ has either the value 1 or the value 2. A realization of such a stochastic process may look like (1, 1, 1, 2, 2, 1, 2, 2, 1, 2, 1, …), where we have showed the values of the random variables $X(t)$ at times $t=0,1,2,….$

The simplest and most important class of this kind of models is given by Markov chains. The key feature of a Markov chain is that the probability distribution of the random variable $X(t)$ is allowed to depend only on the state of the system in the previous step, i.e. on the value of $X(t−1)$. Markov chains can be described with the help of a transition matrix $P$. The elements of this matrix, denoted by $pij$, are the probabilities that the process will next move to a state $j$ if it is currently in state $i$. As an example, let us consider the transition matrix

(A.27)
$Display mathematics$

If the system is presently in state 1, it may either stay in that state (with probability $p11=0.8$) or move to state 2 (with probability $p12=0.2$). If the system is currently in state 2, it will stay in that state with probability $p22=0.4$ and move to state 1 with probability $p21=0.6$. Figure A.6A shows a simulation of this Markov chain model. As the transition matrix is set up so that state 1 is favoured, it is intuitive that the system spends more time in that state than in state 2.

Figure A.6 Illustrations of Markov chains and Markov processes. The upper panels show stochastic simulations of the discrete time Markov chain (A) and the continuous time Markov process (B) of Section A.4. (C, D) show how the probability of being in a given state evolves in time. In (C), the Markov chain is initiated from state 1 at time $t=0$, and the black (respectively, grey) dots show how the probability of being in state 1 (respectively, in state 2) evolves over time. In (D), the Markov process is initiated from state 1 at time $t=0$, and the lines show how the probabilities of being in state 1 (continuous line), state 2 (dashed line), and state 3 (dotted line) evolve over time.

Markov chains are mathematically convenient since much of their behaviour can be derived with the help of matrix algebra. Let us denote by $z(t)=(z1(t),z2(t))$ a row vector that describes the probability distribution of the systems state at time $t$, coded so that $z1(t)$ is the probability that the system is in state 1 and $z2(t)=1−z1(t)$ is the probability that the system is in state 2. At each time step, the probability distribution $z(t)$ changes according to the rules of the transition matrix: the probability that the system will be in state 1 at time $t+1$ is the probability that it was at time $t$ in state 1 and stays there, plus the probability that it was in state 2 and will move to state 1. As an equation, $z1t+1=z1(t)p11+z2(t)p21$. This equation and the analogous equation for $z2t+1$ can be combined in the more compact matrix notation (see Appendix A.1) as $zt+1=z(t)P$. Thus, the probability distribution $zt+1$ is obtained by multiplying the probability distribution $z(t)$ by the transition matrix $P$ from the right.

(p.230) As an example, let us assume that the system is initially in state 1 and thus $z1=(1,0)$. Then we may compute that $z1=(0.8,0.2)$, $z2=(0.76,0.24)$, and $z3=(0.752,0.248)$, which numbers are illustrated in Figure A.6C. We may use matrix algebra to jump directly over multiple time steps. For example, $zt+2=zt+1P=z(t)P2$, where $P2$ means that the matrix $P$ is raised to the second power and thus multiplied by itself in the sense of matrix multiplication. More generally, $zt+k=z(t)Pk$, and thus $Pk$ gives the transition matrix describing how the probability distribution changes over $k$ time steps.

The long-term behaviour of a Markov chain is characterized by its stationary state. In the example in the previous paragraph, the probability distribution $z(t)$ approaches $z∗=(0.75,0.25)$ as time $t$ increases. Mathematically, this can be written as $limt→∞z(t)=z∗$. Thus, if sampling the system long enough after the process started, the probability to find it in state 1 will be 0.75 and the probability to find it in state 2 will be 0.25. This means that the system will (p.231) spend 75% of its time in state 1 and 25% of its time in state 2. The stationary state $z∗$ can be found analytically by noting that it must satisfy the equation $z∗=z∗P$. As discussed in Appendix A.1, this means that $z∗$ is a left eigenvector of $P$ associated with the eigenvalue 1. Solving the eigenvector problem with standard matrix algebra tools yields $z∗=(0.75,0.25)$.

Note that the probabilities $z1,z2$, and $z3$ listed earlier approach, but do not equal, the stationary state $z∗$ because it matters from which state the system is initiated. During the initial transient period, the system ‘forgets’ from which state it was initiated and it converges to the stationary state that is independent of the initial condition. In other words,

(A.28)
$Display mathematics$

showing that whichever was the initial condition, the system will after a sufficiently long time be in state 1 or in state 2 with probabilities 0.75 and 0.25, respectively.

## A.4.2 Markov processes

The continuous time analogue of a Markov chain is a Markov process. While in a Markov chain the system stays a fixed time period in each state before moving to another state (or possibly staying in the same one), in a Markov process the durations of the time steps are exponentially distributed random variables. A Markov process is determined by specifying its transition rates.

Before continuing with the Markov process, let us discuss the relationship between rates and probabilities. Unlike probabilities, rates are not constrained to be between 0 and 1 but they can have any positive values. The probability $p$ by which an event with a rate $q$ takes place during a short time interval $dt$ is given by $p=qdt$. This equation holds only for very small time intervals, so that technically it holds only for $dt→0$. To see this, consider e.g. the rate $q=2$, and let the duration of the time interval be $dt=1$. The formula $p=qdt$ would predict that the probability by which the transition takes place is 2, which clearly cannot be valid. For an arbitrary time interval $Δt$ of any duration, the probability of the event happening is $p=1−e−qΔt$, which gives the probability of $p=0.85$ for $q=2$ and $dt=1$. If assuming a short time interval, the linear relationship $qΔt$ becomes an increasingly good approximation. For example, if $q=2$ and $Δt=0.01$, the approximation $qΔt=0.02$ is very close to the exact value of $p=1−e−qΔt=0.0198$.

Let us then illustrate a Markov process with an example system with three states. We let the transition rate of moving from state 1 to 2 be $q12=2$, and set the other rates to $q13=1$, $q21=0$, $q23=1$, $q31=1$, and $q32=1$. Markov processes can be simulated with the help of the Gillespie (1977) algorithm, the description of which also illustrates mathematical properties of Markov processes. Let us assume that the process is currently in state 1. As the process leaves this state with rates $q12=2$ (to state 2) and $q13=1$ (to state 3), the total rate at which the process leaves the state 1 is $q12+q13=3$. Mathematical theory of Markov processes now tells us that the time that the system stays in state 1 is exponentially distributed with parameter equalling the total rate 3. The expected value of the exponential distribution is the inverse of the parameter (in this case $1/3$), so that a large rate means a short waiting time. When simulating the process, we may thus randomize the duration $Δt$ that the system spends in the present state using an exponentially distributed random number. The next decision to be made in the simulation is to which state the system moves when it leaves the present state 1. This is determined by the relative rates. By dividing the rates $q12=2$ and $q13=1$ by their sum, we see that the system moves to state 2 with probability $2/3$ and to the state 2 with probability $1/3$. Figure A.6B illustrates a simulation of the Markov process just described.

(p.232) Also Markov processes can be studied analytically with the help of matrix algebra. The analogue of the transition probability matrix $P$ is the transition rate matrix $Q$. In the example in the previous paragraph, we have

(A.29)
$Display mathematics$

Here the off-diagonal elements have been set to the transition rates, whereas the diagonal elements have been set so that the row sums equal to 0. This is in contrast with the transition probability matrix $P$, in which the row sums equal to 1.

While in the case of the discrete time Markov chain the state of the system evolves according to the difference equation $zt+1=z(t)P$, in the case of the continuous time Markov process it evolves according to the system of differential equations

(A.30)
$Display mathematics$

Figure A.6D illustrates the solution to this system of differential equations, thus showing how the probability of being in each of the three states behaves if starting initially from state 1. After long enough time, the system converges to the stationary state. At the stationary state the probability distribution does not change anymore, and thus its derivative is 0, meaning that the stationary probability distribution $z∗$ must satisfy the equation $0=z∗Q$. Thus, $z∗$ is the left eigenvector of the matrix $Q$ corresponding to the eigenvalue 0. For our example, we obtain , reflecting the amount of time that the system spends in the three states at the stationary state.

The transition rate matrix $Q$ can be translated to a transition probability matrix with the help of the equation $P(Δt)=eΔtQ$, where $Δt$ is the time interval over which the transition probability matrix is to be computed. Here $eΔtQ$ is the matrix exponential of the matrix $ΔtQ$. It is defined, analogously to the Taylor expansion of the usual exponential function, as the power series $eΔtQ=I+12!Δt2Q2+13!Δt3Q3+…,$ where I stands for the identity matrix. We note that $limΔt→0P(Δt)=I$, simply confirming that over a vanishingly short time the system will stay where it was originally.

Sometimes Markov processes or Markov chains have an absorbing state. For example, consider the transition rate matrix

(A.31)
$Display mathematics$

With this transition rate matrix, the system will never leave from state 2. Whether the system is initiated from state 1, 2, or 3, it will sooner or later enter state 2. For this reason, state 2 is called an absorbing state, and the stationary distribution is concentrated in this state, $z∗=010T$. In population models without the possibility of immigration from outside, population extinction will represent an absorbing state. In such a case, it is often of interest to study the quasi-stationary distribution, which describes the probability distribution after long enough time so that the initial condition has become irrelevant, but before the system has reached the absorbing state. Further discussion about quasi-stationary distributions can be found e.g. from Darroch and Seneta (1967), and an application in the ecological context e.g. from Ovaskainen (2001).