 Andy Hector

Print publication date: 2015

Print ISBN-13: 9780198729051

Published to Oxford Scholarship Online: March 2015

DOI: 10.1093/acprof:oso/9780198729051.001.0001 Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2019. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in OSO for personal use (for details see www.oxfordscholarship.com/page/privacy-policy).date: 19 June 2019

Comparing Groups: Analysis of Variance

Chapter:
(p.7) 2 Comparing Groups: Analysis of Variance
Source:
The New Statistics with R
Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780198729051.003.0002

Abstract and Keywords

Analysis of variance is introduced using a worked analysis of some data collected by Charles Darwin on inbreeding depression in experimental maize plants. ANOVA is used to compare the height of a group of cross-pollinated plants with a group of self-fertilized seedlings. The least squares process is explained including the calculation of sums of squares and variances and the calculation of F-ratios. The organization of the results of a linear model into an ANOVA table is explained. The R code necessary to perform the analysis is worked through and explained in terms of the underlying statistical concepts and the interpretation of the R output demonstrated.

The New Statistics with R. Andy Hector. © Andy Hector 2015.

Published 2015 by Oxford University Press.

2.1 Introduction

Inbreeding depression is an important issue in the conservation of species that have lost genetic diversity due to a decline in their populations as a result of over-exploitation, habitat fragmentation, or other causes. We begin with some data on this topic collected by Charles Darwin. In The effects of cross and self-fertilisation in the vegetable kingdom, published in 1876, Darwin describes how he produced seeds of maize (Zea mays) that were fertilized with pollen from the same individual or from a different plant. Pairs of seeds taken from self-fertilized and cross-pollinated plants were then germinated in pots and the height of the young seedlings measured as a surrogate for their evolutionary fitness. Darwin wanted to know whether inbreeding reduced the fitness of the selfed plants. Darwin asked his cousin Francis Galton—a polymath and early statistician famous for ‘regression to the mean’ (not to mention the silent dog whistle!)—for advice on the analysis. At that time, Galton could only lament that, ‘The determination of the variability . . . is a problem of more delicacy than that of determining the means, and I doubt, after making many trials whether it is possible to derive useful conclusions from these few observations. We ought to have measurements of at least fifty plants in each case’. Luckily we can now address this question using any one of several closely related (p.8) linear model analyses. In this chapter we will use the analysis of variance (ANOVA) originally developed by Ronald Fisher (Box 2.1) and in Chapter 3 Student’s t-test.

The focus of this book is statistical analysis using R, not the R programming language itself (see the Appendix). R is therefore introduced relatively briefly, and if you are completely new to the language you will need to read the introductory material in the Appendix first and refer back to it as needed, together with the R help files. You should also explore the wealth of other information on R recommended there and available via the web.

In R we can get Darwin’s data (Box 2.2) from an add-on package called SMPracticals after installing it from the Comprehensive R Archive Network website (together with any other packages it is dependent on) and activating it using the library() function. Notice the use of the hash symbol to add comments to the R code to help guide others through your analysis and remind yourself what the R script does:

• > install.packages(SMPracticals, dependencies= TRUE)> # install package from CRAN website
• > library(SMPracticals) # activates package for use in R
• > darwin # shows the data on screen
(p.9)
 pot pair type height 1 I 1 Cross 23.500 2 I 1 Self 17.375 3 I 2 Cross 12.000 4 I 2 Self 20.375 5 I 3 Cross 21.000 6 I 3 Self 20.000 7 II 4 Cross 22.000 8 II 4 Self 20.000 9 II 5 Cross 19.125 10 II 5 Self 18.375 11 II 6 Cross 21.500 12 II 6 Self 18.625 13 III 7 Cross 22.125 14 III 7 Self 18.625 15 III 8 Cross 20.375 16 III 8 Self 15.250 17 III 9 Cross 18.250 18 III 9 Self 16.500 19 III 10 Cross 21.625 20 III 10 Self 18.000 21 III 11 Cross 23.250 22 III 11 Self 16.250 23 IV 12 Cross 21.000 24 IV 12 Self 18.000 25 IV 13 Cross 22.125 26 IV 13 Self 12.750 27 IV 14 Cross 23.000 28 IV 14 Self 15.500 29 IV 15 Cross 12.000 30 IV 15 Self 18.000

(p.10) A good place to start is usually by plotting the data in a way that makes sense in terms of our question—in this case by plotting the data divided into the crossed and selfed groups (Fig. 2.1). R has some graphical functions that come as part of the packages that are automatically installed along with the so-called base R installation when you download it from the CRAN website. However, I am going to take the opportunity to also introduce Hadley Wickham’s ggplot2 (Grammar of Graphics, version 2) package that is widely used throughout this book. While ggplot2 has an all-singing all-dancing ggplot() function it also contains a handy qplot() function for quickly producing relatively simple plots (and which will take you a surprisingly long way). One advantage of this qplot() function is that its syntax is very similar to that of the base R graphics functions and other widely used R graphics packages such as Deepayan Sarkar’s Lattice. Luckily, ggplot2 is supported by a comprehensive website and book so it is easy to expand on the brief introduction and explanations given here. If you do not have the ggplot2 package on your computer you can get it by rerunning the install.packages() function given earlier but substituting ggplot2 in place of SMPracticals. Notice that the qplot() function has a data argument, and one restriction when using ggplot2 is that everything we want to use for (p.11) the plot must be in a single data frame (in this case everything we need is present in the darwin data frame but if it were not we would have to create a new data frame that contained everything we want to use for our graphic):

• > library(ggplot2) # activate package
• > qplot(type, height, data= darwin, shape= type, colour=> type)+theme_bw()
• > ggsave(file= "Hector_Fig2-1.pdf") # save graph as pdf Figure 2.1 The height of Darwin’s maize plants (in inches) plotted as a function of the cross- and self-pollinated treatment types. Notice how easy it is with ggplot2 to distinguish treatments with different symbol types, colours (seen as different shades of grey when colour is not available), or both and how a key is automatically generated.

Some of the advantages of ggplot2 over the base graphic functions are immediately obvious in how simple it is to use different symbol shapes, colours, and backgrounds (deleting the theme_bw() command for the black and white background will reveal the default theme_grey() setting that is handy when using colours like yellow that do not show up well against white) together with an automatically generated key with a legend. Even better, notice how easy it is to save the file (in various types, sizes, or rescalings) using the handy ggsave() function.

Figure 2.1 suggests that the mean height may be greater for the crossed plants, which would be consistent with a negative effect of inbreeding. But how confident can we be in this apparent signal of inbreeding depression given the level of noise created by the variation within groups? The variability seems reasonably similar in the two groups except that the crossed group has a negative value lying apart from the others with a height of 12—a potential outlier. Actually, as we will see in Fig. 2.2, there are two values plotted on top of one another at this point (see the ggplot2 website and online supplementary R script at <http://www.plantecol.org/> for the use of geom(jitter) as an additional argument to the qplot() function that deals with this issue). This is typical of many data. The outlying negative heights could be due to attack by a pest or disease or because somebody dropped the pot, accidentally damaged the plant, or simply took the measurement incorrectly. It is hard to say anything more specific using this eyeball test since the difference between treatment groups is not that dramatic and there is a reasonable degree of variability within groups, not to mention (p.12) those troublesome outliers. The ANOVA will quantify this signal-to-noise ratio and how confident we can be that we are not being fooled by a false positive: an apparent difference where one does not really exist (a ‘type I error’). The larger the difference relative to the variability the more confident we can be.

2.2 Least squares

The general strategy of ANOVA is to quantify the overall variability in the data set and then to divide it into the variability between and within the self- and cross-pollinated groups and to calculate a signal-to-noise ratio. The greater the ratio of the signal to the noise the more confident we can be that we have detected a real effect. To quantify the variability we can use a method called least squares that begins by measuring the differences from the individual data points to some reference point. The most intuitive one is the overall grand mean, but any value will do and as we will see later statistical software usually chooses a different benchmark. The method is probably better illustrated in the context of regression (see Chapter 4) but the name conveys the idea that the analysis finds the lines of best fit, in this case the treatment-level means, by minimizing the overall squared distances to the individual data points. Of course, we cannot simply add up the differences from each data point to the mean and use this as a measure of variability because the negative differences will always cancel the positive ones and consistently result in a value of zero! We could take the absolute values, but early statistics instead solved this problem by squaring the differences so that the negative values become positive. The resulting value—the sum of the squared differences, or sum of squares (SS) for short—is our measure of the overall variability in the data.

Figure 2.2 plots the heights of the plants together with the overall mean and the difference between each value and this average. To save space, the R code for this figure is presented in the online supplementary R script at <http://www.plantecol.org/>. However, as indicated by its name, ANOVA works with another quantity, the variance (usually symbolized as sigma (p.13) squared, σ‎2). The variance is also called the mean square (MS) because it is an average amount of variation: it might be useful to think of it loosely as a per data point average amount of variation. I say ‘loosely’ because the SS is actually averaged using a mysterious quantity called the degrees of freedom (DF; Box 2.3). The total number of DF is one less than the number of data points (30 – 1 = 29). Figure 2.2 How to calculate the total (SST), treatment (SSA), and error (SSE) sums of squares. In each panel the vertical lines measure the differences that are then squared and summed. The SST (left) is calculated by measuring the differences from each data point to some reference point—the overall mean is the most intuitive one for teaching purposes as shown by the horizontal line (although for technical reasons it is generally not the one used by statistical software!). The differences for the SSA (middle) are between the treatment-level means (the horizontal lines of ‘fitted values’ shown by the crosses and circles) and the grand mean. The differences for the SSE (right) are between the observed data values and the treatment-level means.

However, there’s a catch: because of the squaring necessary to calculate the SS this estimate of the variability is not on the same scale as the original (p.14) data. Instead, we have a measure of variation on the squared scale: that is not inches but square inches! So, if we used the SS to compare the variability in Darwin’s sample with the mean value we’d be comparing average height in inches with the variability in square inches—something we would associate more naturally with an area rather than a height. Luckily, this problem is easily solved by reversing the earlier squaring: by taking the square root of the variance we get the standard deviation (SD, S, or sigma, σ‎) which is a measure of variability on the same scale as the original data and which can be conveniently compared with the mean (or some other measure of central tendency). R has functions for calculating the mean and the SD (notice in the code the use of the dollar sign to separate the names of the data frame from the column of interest and that putting the command inside parentheses causes the result to be printed):

• > mean(darwin$height) •  18.88333 • > sd(darwin$height)
•  3.180953

The SD is a descriptive statistic that quantifies the variability in the heights of the 30 plants and which is on the same scale as the original measurements and the mean. It can be thought of as the average distance (p.15) between the mean and a data point. We will look at it again later in Chapter 5. R also has functions for the variance and for taking square roots that we can use to check the relationship between the SD and the variance:

• > var(darwin$height) •  10.11846 • > sqrt( var(darwin$height))
•  3.180953

To recap, we have seen how the least squares process minimizes the SS to find the best estimates and how it uses these SS and DF to calculate a mean square, or variance. Finally, taking the square root of the variance produces the SD, which is a measure of the variability in a sample that is conveniently on the same scale as the original measurements. However, by treating the 30 values as a single sample of plant heights we are ignoring the question of interest: do we in fact have two samples with different mean heights due to inbreeding?

2.3 Differences between groups

One useful approach for graphically describing and summarizing data in different treatments is the box-and-whisker plot, invented by the statistician John Tukey. A ggplot2 box plot of Darwin’s maize data produces Fig. 2.3:

• > qplot(type, height, data= darwin, geom= "boxplot",
• > main= "Darwin's Maize: Box-&-Whisker plot",
• > xlab= "Fertilization",
• > ylab= "Height (inches)")+theme_bw() Figure 2.3 A Tukey box-and-whisker plot of Darwin’s maize data showing the median heights (the horizontal lines within the boxes) and the 25th and 75th percentiles (the top and bottom of the boxes), with the whiskers distinguishing the main body of the data from outliers.

Box plots provide a rich summary of the data giving the medians (the horizontal lines within the boxes) and the 25th and 75th percentiles (the top and bottom of the box); the ends of the whiskers separate the main body of the data from outlying values. We mostly work with means, but when data (p.16) are skewed the median (the half-way point) may provide a more appropriate measure of central tendency. The graph is certainly suggestive of differences between groups, but box plots are descriptive and a different approach is required to make inferences about the treatment effects.

2.4 Data in long and wide formats

The darwin data frame is in long format where each data point is on a separate row. This is usually the data format we need for conducting linear model analysis in most statistical software. However, you may have the data in a spreadsheet in wide format—with the heights for the crossed and selfed plants in different columns—or need to organize them in that way. The reshape package is designed for rearranging data sets and making statistical summaries of them. At its core is the idea of melting data sets into molten form using the melt() function and then casting them into a new arrangement or a summarized form using cast(). First we create a molten (p.17) form of the data set that classifies the columns into the identity variables that define its structure and measure variables—in this case just our single response variable:

• > library(arm)
• > mDarwin <- melt(darwin, id.vars= c("pot", "pair", "type"),
• > measure.vars="height") # not shown

Then we can cast a wide version of the data set that specifies pot and pair to be the rows (given on the left-hand side of the tilde, ~) and forming a column of heights for each pollination type (as specified on the right-hand side of the tilde—note that because we only measured one response the column labelled ‘variable’ has only this one repeated entry all the way down):

• > darwide <- cast(mDarwin, pot + pair ~ variable + type)
 pot pair height_Cross height_Self 1 I 1 23.500 17.375 2 I 2 12.000 20.375 3 I 3 21.000 20.000 4 II 4 22.000 20.000 5 II 5 19.125 18.375 6 II 6 21.500 18.625

The head() function allows us to just output the first few rows of the reshaped data frame. By substituting darwide$height.cross and darwide$height.self into the functions given earlier for the mean and SD of the long version of the data we can see that the crossed plants have a height of 20.2 while the selfed plants have a lower mean height of 17.6, a shortfall of about 2.6 inches. The question is: how confident are we that this difference reflects negative effects of selfing? To judge this we have to assess the signal (the difference between treatment means) relative to the noise (the level of variation within the samples). ANOVA uses variance to quantify the signal and noise, but to get a first impression of the level of variability on the same scale as the means we can apply the sd() function to the wide data frame to get values for the SDs of the crossed and selfed samples of 3.6 and 2.1 inches, respectively (see Section 2.2).

(p.18) 2.5 A first linear model analysis: ANOVA

Although this book points out the distinction between ANOVA, analysis of covariance (ANCOVA), and linear regression one key point that I want to try and get across is that it is better to think of these as closely related forms of linear model analysis that are best handled in a general way. Luckily, R is set up to work exactly like this and we can perform all of these types of analysis using the linear model function, lm(). A good, if slightly unconventional, place to start is to use the lm() function to estimate the grand mean and total SS that were explained earlier in Section 2.2. We have already created R objects and assigned values to them—for example, we created an R data frame object called darwide and assigned it the values of the reordered darwin data frame. We are now going to use the lm() function to perform a linear model analysis and assign it to a linear model object that we will call ls0 (for least squares model 0—not to be confused with the R list function, ls()):

• > ls1 <- lm( height ~ 1, data= darwin)

The arguments of the lm() function specify that we want to analyse a response (height) as a function of (~) an explanatory variable. Notice the use of the 1 to specify that in this case we do not have an explanatory variable and simply want to estimate the grand mean. Also notice how the data argument saves us having to add the name of the data frame followed by a dollar symbol in front of every variable name. I recommend that you use the data argument whenever you have the opportunity unless there is a good reason not to.

The statisticians Andrew Gelman, Jennifer Hill, and their colleagues have written a handy display() function as part of their arm package to extract the key features of linear models for teaching purposes, so we’ll begin by using this:

• > library(arm)
• > display(ls1)
 coef.est coef.se (Intercept) 18.88 0.58 --- n = 30, k = 1 residual sd = 3.18, R-Squared = 0.00

(p.19) The headings to the two columns of numbers indicate that these are estimates of model coefficients with their standard errors (we shall explore standard errors later). The first row of this type of R output is always generically labelled as ‘(Intercept)’ and in this case the intercept is simply the overall grand mean that we have already calculated using the mean() function. The sample size, n, gives the number of data points (and rows in the data frame) and the number of parameters estimated—the grand mean—is given by k. We know that the SD is the square root of the variance so could 3.18 be the square root of the variance we estimated earlier using the var() function (see Section 2.2)?

• > ( 3.18^2 )
•  10.1124

Bar a bit of rounding error, yes. That was a useful exercise for demystifying what is going on but what we really want is a linear model that analyses a response—differences in plant height—as a function of a categorical explanatory variable, namely the type of pollination (with two categories, crossed and selfed). We can create an object for this analysis, called ls1:

• > ls1 <- lm(height~ 1 + type, data= darwin)> # or, equivalently: lm(height~ type…)

This linear model analysis makes a few assumptions that we will gradually explore later in this book. If the assumptions are badly contravened then the output of the analysis may not be entirely trustworthy. Before looking at the results of an analysis it is therefore a good idea to check that the assumptions have been satisfactorily met and consider the likely consequences of any substantial infringements. However, we will postpone (p.20) this until Chapter 3 since we have already seen that one of the assumptions may be contravened—the variability in the two treatments may be unequal, particularly due to the presence of two apparently outlying data points in the crossed treatment.

The model formula now contains type in addition to the intercept indicated by the 1. Actually, we could omit the 1 since R will automatically include it for us (if you compare this model with one that omits the 1 you will see the output is identical—try it). However, the intercept is no longer the grand mean as in model ls0, as we can see from the display() function output:

• > display(ls1)
 Coef.est coef.se (Intercept) 20.19 0.76 typeSelf -2.62 1.07 --- n = 30, k = 2 residual sd = 2.94, R-Squared = 0.18

Now there are two rows of numbers in the output, so what is the intercept? We can work out what the intercept is in the style of Sherlock Holmes—by eliminating all of the alternative possibilities bar one. The label of the second row, typeSelf, is produced by giving the name of the explanatory variable (type) followed, with no spacing, by ‘Self’—the name of one of the levels of this factor (a factor is a categorical explanatory variable). Since type has only two levels the intercept must be the other, and could be relabelled as typeCross (Box 2.4).

(p.21) So the coefficient in the row labelled intercept is the average height of the 15 maize plants in the crossed treatments. A common mistake is to think that the value in the second row is the height of the selfed plants. This would be a fair assumption, as the label typeSelf implies that is what it ought to be. However, in this case it should be obvious to anyone with their wits about them that this cannot be the mean height of the selfed plants since the value is negative! Instead, the output shows the mean value of one of the factor levels (by default whichever comes first numerically or alphabetically, see Box 2.5) and the difference between this value and the mean of the other treatment level. So in this case the intercept refers to the crossed plants and the second row is the difference in the mean height of the two groups. This may seem a bit odd at first sight but there are good reasons for doing it this way, and in this case it also focuses our attention on the question of interest: is there any difference in the heights of the plants depending on whether pollination was through selfing or outcrossing? Note that the output does not give the overall grand mean that we calculated earlier: our question does not involve the grand mean and estimating it would waste one of our precious DF. Instead, one of the treatment-level means is taken as a baseline for the calculations (in this case: type = Crossed). The second column gives the standard errors of the values in the first column. Since the first column of numbers contains a mean and a difference between means we now know that the second column gives a standard error of the mean (SEM) and a standard error of a difference between the two means (SED).

(p.22) One potentially confusing issue is the difference between SD and standard error. A standard error is simply the SD of a statistic—so the SEM is a SD of a mean. The SEM quantifies the variation we would expect to get if we repeatedly estimated the mean using different samples. Recall from earlier that the SD is the square root of the variance. The standard error is related to the variance in a similar but slightly more complex way

$Display mathematics$

or equivalently

$Display mathematics$

where n is the number of data points used to calculate the variance. What does n do in the calculation of the standard errors? For a given level of variability—as reflected in the variance and SD—dividing by the square root of the size of the sample will produce smaller standard errors for larger samples relative to smaller ones. This makes good intuitive sense: we have more confidence in the mean of a large sample than that of a small sample (just think about how much confidence you would have in predictions from polls of voting intentions based on small or large samples of people).

The SED between the means of two groups is

$Display mathematics$

where the subscript ‘a’ indicates values for the crossed plants and ‘b’ those for the selfed plants. Here, because sample sizes are equal, the SED is the square root of two (~1.4) times the SEM. This is a very handy relationship to remember, as it will enable us to convert between different types of error bars and confidence intervals, as we will see in Chapter 5.

The lower half of the display in Box 2.5 gives the number of parameters, k, as 2, which follows from the two treatment levels in our explanatory factor. It finishes by giving a ‘residual’ SD and the R squared. The R squared is the (p.23) proportion of the total sum of squares (SST) explained by the statistical model, in this case by the pollination treatment. Notice that the value of the residual SD (2.94) is somewhere between the SD values of the two treatments we calculated earlier (3.6 and 2.1); we’ll come back to this in a moment.

One limitation is that we don’t have the mean and standard error of the other treatment level. It is easy enough to calculate the mean by simple addition or subtraction but the standard error cannot be calculated in this way. Instead, we can get R to estimate the mean and standard error for the selfed treatment by using the relevel() function to override the default assignment of the crossed treatment as the intercept (Box 2.8).

• > darwin$type <- relevel(darwin$type, ref= “Self”)

If we then rerun the linear model and display commands we get the following changes in the output:

 Analysis of Variance Table Response: height Df Sum Sq Mean Sq F value Pr(>F) Type 1 51.352 51.352 5.9395 0.02141* Residuals 28 242.083 8.646 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• coef.est coef.se
• (Intercept) 17.58 0.76
• typecross 2.62 1.07

Now the selfed treatment level is taken as the intercept and we get its point estimate and standard error together with the same difference in means (with the opposite sign) and SED as before.

Notice that the SEM of the selfed treatment is the same as that for the crossed treatment. How can this be when earlier on we found the variances and SDs for the two treatment levels to be different? One key feature of (p.24) ANOVA is that it calculates a pooled estimate of the so-called residual variation within treatments. The logic is that by estimating the variance using all treatments we can get a better estimate of the noise than if we did the calculation for each group separately (the smaller the group the poorer the estimate). We can see how this works by looking at an ANOVA table. An ANOVA table summarizes the least squares calculations we worked through earlier and performs F-tests using the variances.

2.6 ANOVA tables

ANOVA tables can be produced in R for analyses conducted with the lm() function by using the anova() function:

• > anova(ls1)
 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.1917 0.7592 26.596 <2e-16 typeSelf -2.6167 1.0737 -2.437 0.0214

First, notice that while the display output contained direct information about the biology (the average height of the selfed plants and the average difference in heights) the ANOVA table only contains information about the statistical analysis and should therefore be of less interest to us. The first column gives the source of the variation, in this case the pollination treatment or the unexplained variation that is left, i.e. the residual. The numbers in the next three columns—the DF, sums of squares, and the mean square (i.e. variance)—have already been introduced and briefly explained in relation to calculating the overall variation in the data. Unlike some statistical software packages, R does not give a final row for the total DF and SS; if we wanted to add these to our table we could simply take (p.25) them from the anova() function output for the ls0 model given earlier (it is easy to check this is correct since SS and DF are both additive—try it). We shall look at how we decompose the total SS into two parts in a moment, but first notice that there is a single residual mean square (variance) that is calculated by pooling the variation within both samples in order to get a better estimate of the noise than would be the case when calculating the variance separately for both groups based on the smaller sample size (as we did earlier). This is particularly valuable when we have a greater number of treatments with smaller sample sizes, and was one of the key ideas in the development of ANOVA. This pooled estimate obviously lies between the estimates of the variances calculated earlier for each treatment separately. ANOVA always uses the pooled estimate of the variance. For example, when calculating the standard error for both treatment levels we plug the pooled estimate (8.646) into the equation given above (p22, top) when calculating the standard errors for both (all) treatments. Because the treatment sample sizes are also the same (n = 15) both levels have the same SEM. In general, when using ANOVA—or any related linear model analysis that employs a pooled estimate of the residual variance—when the sample sizes per treatment level are the same then the SEMs will have the same value. Conversely, if the SEMs vary then the sample sizes must vary.

When introducing ANOVA I said that the general strategy was to decompose the overall variance that we have calculated into between- and within-group variances. How do we do that? The overall variation was quantified by calculating the sums of squared differences from each data point to the overall grand mean (Fig. 2.2, left). We have already seen that statistical software generally does not use the overall grand mean because this may not be of interest and would waste a precious DF. Instead, statistical software typically selects one of the treatment means as a baseline for the calculations (in this example R selects the crossed treatment level by default since it follows numerical and alphabetical order). Nevertheless, since it is usually seen as the most intuitive baseline, and since we have already used it, let’s continue in the same fashion using the grand mean. Another way to look at the general idea of the analysis is that it asks how much of the variability in the data (p.26) we can explain simply by categorizing the data points into the two treatment levels, crossed and selfed. To calculate the SS explained by the pollination treatment we essentially replace the data points with their expected or ‘fitted’ values—these are simply the treatment-level means (Fig. 2.2, centre). After making this change we then repeat the least squares process introduced above: measure the distance from the grand mean to each fitted value (i.e. the relevant treatment mean), square, and add up to produce the SS for our treatment (SSA; typically the first treatment is called A and so on). One key thing to remember about SS is that they are additive. Because we now have the total and one part of it, what is left over (the residual) can be worked out by subtraction. What is left is the variation within treatments; that is the variation that cannot be explained by the treatment-level means. To calculate it directly we repeat the least squares process one more time but now simply measuring the difference from each data point to the relevant treatment-level mean (Fig. 2.2, right). Summing these squared distances produces the residual SS, which is more commonly abbreviated to its alternative name, the error sum of squares (SSE).

We have already introduced the calculation of the mean square (variance) by dividing the SS by the DF, and we have learnt that the DF is one less than the number of values used for the calculation (DF = n – 1) in this case 30 – 1 = 29. In the ANOVA table, the total of 29 DF has been decomposed into a single DF for the pollination-type treatment and 28 (29 – 1) for the residual. Notice that when calculating the DF for the treatment we subtract one from the number of factor levels (not the number of data points), in this case 2 – 1 = 1. The residual variance from the ANOVA table is simply the square of the residual SD from the arm package display() function output given above.

2.7 Null hypothesis significance testing

Finally, we are ready to put Darwin’s hypothesis on inbreeding depression to the test. The ANOVA uses the differences between treatments and the variation within groups to calculate a signal-to-noise ratio: the larger this (p.27) variance ratio is, the more confident we can be that the effect of the treatment is real and not a false positive. The signal-to-noise ratio is calculated by dividing the treatment variance by the residual error variance to produce the F-value (originally called the variance ratio but later renamed in Fisher’s honour). In this case, the value of 5.9 means the estimated signal is nearly six times larger than the estimated noise. In a perfect world, if there were no signal then the treatment variance would be nothing more than another measure of the error variance. If we could estimate both the signal and noise perfectly then the variance ratio produced when there was absolutely (p.28) no effect of the treatment would be equal to one (the null expectation). In practice, since we are estimating the signal and noise from samples, the treatment variance will sometimes be estimated (or rather underestimated) to be smaller than the error variance, and occasionally we will get F-ratios less than one (‘negative variance components’). So long as we are confident there are no problems with our data and analysis then we can do little more than write these off as sampling error. Of course, the variance ratio could also be bigger than one simply due to sampling error, but the larger the signal is relative to the noise the more confident we can be that it is not a false positive. The probability value, P, quantifies this for us. The probability value of 0.02 says that there is a 2% chance of having data that would produce a signal-to-noise ratio of this size (or larger) if there was actually no effect of pollination type. Essentially there is a 2% chance of our result being a false positive.

The probability value in the ANOVA table is calculated from the F-distribution. The probability assigned to a given F-value depends on the DF of both the signal and the noise, which in turn depend on the sample size and complexity of the statistical model. We can see this using the R pf() function to re-create the P-value in the ANOVA table:

• > pf(5.9395, 1, 28, lower.tail= FALSE)
•  0.02141466

The last argument in the function is set to give us the probability of being in the tail of the distribution with an F-value equal to or greater than the observed value (5.9). Because of this it is important that when reporting results we don’t just give probability values but present them together with the F-value and its two sets of DF, something like this:

> The height of the cross-pollinated plants was significantly greater than that of the self-pollinated plants (F1,28 = 5.9; P = 0.02).

We could further improve on this presentation by saying something about the height of the plants in the different treatments and not just that the average height of one treatment was significantly bigger than the other (Box 2.8):

(p.29) The display() function from the arm package can be thought of as a teaching version of the summary() function. We can compare the differences in the output of these functions (notice the use of the options() function to turn off the significance highlighting stars based on the earlier discussion):

• > options(show.signif.stars= FALSE) # turn significance level > # stars off
• > summary(ls1)
 Df Sum Sq Mean Sq F value Pr(>F) Pair 14 86.264 6.162 0.5536 0.8597 Type 1 51.352 51.352 4.6139 0.0497 Residuals 14 155.820 11.130
• Residual standard error: 2.94 on 28 degrees of freedom
• Multiple R-squared: 0.175,Adjusted R-squared: 0.1455
• F-statistic: 5.94 on 1 and 28 DF, p-value: 0.02141

The estimates and standard errors are the same as in the display function but are shown to more decimal places by default. However, the summary function includes test statistics and probability values. The lower P-value of 0.0214 is the same as that produced by the F-test given above, even though it is attributed to a t-value produced by a t-test rather than the F-test used in our ANOVA! Clearly, the t-test and the F-test produce equivalent P-values, at least for a simple situation like this (we will see later that this is not always the case in more complex settings). As we will see in (p.30) Chapter 3, while the F-test works with the variance on the squared scale the t-test works with the estimate and SEs on the original scale of measurement. It would therefore seem to follow that F = t2; is this true?

• > (2.437^2)
•  5.938969

Bar a bit of rounding error, yes! So, for a simple one-way ANOVA like this the t-test presented in the second row of the summary() function output given above is equivalent to the F-test in the ANOVA table for asking whether there are significant differences between the two treatment-level means. We’ll postpone looking at the t-test in greater detail until Chapter 3. The residual standard error is the same as the residual SD given in the display() function output. The multiple R squared is the proportion of the variation explained by our model as quantified by the SS—in this case pollination type explains nearly 18% of the total variation in height as quantified by the SST. The adjusted R squared is a related measure that takes into account the complexity of the model—we’ll meet this again later together with information criteria. The last line of the summary() function output produces an F-test for the whole model. In this case, the model only contains the effects of pollination type so it simply repeats the result we have already seen.

If you look back to the start of this chapter you will see that the data set has additional information: in particular, equal numbers of crossed and selfed plants were planted in each pot. Essentially the plants are grouped into pairs, one from each treatment. In Chapter 3 we will analyse these data using the paired t-test, as Fisher did, and we can add the pairing to the ANOVA as follows:

• > anova(lm( height~ pair + type, data= darwin)

You can see that adding the pairing to the analysis takes up a lot of DF and it changes the calculation of the mean squares. We are not interested in the (p.31) pairing in itself, but notice how the pairs are less variable than we would expect based on the residual error (the mean square for pair is only around half of the residual error). Pair therefore has an F-value smaller than the expected value of 1 (indicating a ‘negative variance component’). Situations like this can arise due to sampling error but may also alert us to problems with the experimental design (see Box 2.9). As a consequence, in this case adding pairing unexpectedly lowers the significance of pollination (p.32) type compared with the unpaired analysis, but it remains just below the arbitrary 5% mark (if it rose further and lay between 5 and 10% we might refer to it as marginally significant). Notice that type is now tested with only 14 DF for the residual error which reduces the statistical power of the test—you can demonstrate this for yourself using the pf() function introduced earlier if you keep the F-value constant and vary the error DF. In this case, due to the negative variance component, the cost in terms of DF of adding the pairing to the model outweighs the potential benefits we would expect in general due to a reduced residual error.

2.8 Summary

Let us recap what we’ve done in this chapter. Our question, following Darwin, was whether inbreeding reduces fitness. In the experiment we use the height of maize plants as a surrogate for their fitness and ask whether it is reduced by self-pollination. The least square process estimates the lines of best fit—the means for each treatment—and quantifies the variation in the data using SS. The SST can be decomposed into the variation between treatments (the signal) and the residual variation within treatments (the noise). We get average amounts of variation by calculating the mean squares—or variances—by dividing the SS by the DF. The variance ratio (or F-value) quantifies the ratio of the signal to the noise where the null expectation is a value of 1. As values get larger than 1 the probability of this happening by chance (a false positive) gets smaller and we become more confident that we are seeing a real effect. In our case the F-value was nearly six and the probability of this being a false positive with the data set and statistical model used is less than 5%—the conventional point at which to declare a result statistically significant. Notice that this arbitrary level of statistical significance does not necessarily mean that the effect is biologically significant. Our enhanced presentation gave the mean heights of the two treatments and the difference between treatments, but the ANOVA approach put the emphasis on the P-value and on passing arbitrary levels of significance. In Chapter 3 we will explore the t-test further and see how (p.33) the standard errors that it is based on can be used to give an approach that focuses less on the P-value and level of significance and more on the estimated heights and differences together with their uncertainties. In this example the analysis seems to support the hypothesis that inbreeding is detrimental to fitness (although the result is close to the conventional boundary of significance and both treatments have some outlying values). (p.34)