Comparing Groups: Analysis of Variance
Comparing Groups: Analysis of Variance
Abstract and Keywords
Analysis of variance is introduced using a worked analysis of some data collected by Charles Darwin on inbreeding depression in experimental maize plants. ANOVA is used to compare the height of a group of crosspollinated plants with a group of selffertilized seedlings. The least squares process is explained including the calculation of sums of squares and variances and the calculation of Fratios. The organization of the results of a linear model into an ANOVA table is explained. The R code necessary to perform the analysis is worked through and explained in terms of the underlying statistical concepts and the interpretation of the R output demonstrated.
Keywords: analysis of variance, least squares, linear model, ANOVA table, Fratios, sums of squares, variance
The New Statistics with R. Andy Hector. © Andy Hector 2015.
Published 2015 by Oxford University Press.
2.1 Introduction
Inbreeding depression is an important issue in the conservation of species that have lost genetic diversity due to a decline in their populations as a result of overexploitation, habitat fragmentation, or other causes. We begin with some data on this topic collected by Charles Darwin. In The effects of cross and selffertilisation in the vegetable kingdom, published in 1876, Darwin describes how he produced seeds of maize (Zea mays) that were fertilized with pollen from the same individual or from a different plant. Pairs of seeds taken from selffertilized and crosspollinated plants were then germinated in pots and the height of the young seedlings measured as a surrogate for their evolutionary fitness. Darwin wanted to know whether inbreeding reduced the fitness of the selfed plants. Darwin asked his cousin Francis Galton—a polymath and early statistician famous for ‘regression to the mean’ (not to mention the silent dog whistle!)—for advice on the analysis. At that time, Galton could only lament that, ‘The determination of the variability . . . is a problem of more delicacy than that of determining the means, and I doubt, after making many trials whether it is possible to derive useful conclusions from these few observations. We ought to have measurements of at least fifty plants in each case’. Luckily we can now address this question using any one of several closely related (p.8) linear model analyses. In this chapter we will use the analysis of variance (ANOVA) originally developed by Ronald Fisher (Box 2.1) and in Chapter 3 Student’s ttest.
The focus of this book is statistical analysis using R, not the R programming language itself (see the Appendix). R is therefore introduced relatively briefly, and if you are completely new to the language you will need to read the introductory material in the Appendix first and refer back to it as needed, together with the R help files. You should also explore the wealth of other information on R recommended there and available via the web.
In R we can get Darwin’s data (Box 2.2) from an addon package called SMPracticals after installing it from the Comprehensive R Archive Network website (together with any other packages it is dependent on) and activating it using the library() function. Notice the use of the hash symbol to add comments to the R code to help guide others through your analysis and remind yourself what the R script does:
 > install.packages(SMPracticals, dependencies= TRUE)> # install package from CRAN website
 > library(SMPracticals) # activates package for use in R
(p.9)
 > darwin # shows the data on screen
pot 
pair 
type 
height 

1 
I 
1 
Cross 
23.500 
2 
I 
1 
Self 
17.375 
3 
I 
2 
Cross 
12.000 
4 
I 
2 
Self 
20.375 
5 
I 
3 
Cross 
21.000 
6 
I 
3 
Self 
20.000 
7 
II 
4 
Cross 
22.000 
8 
II 
4 
Self 
20.000 
9 
II 
5 
Cross 
19.125 
10 
II 
5 
Self 
18.375 
11 
II 
6 
Cross 
21.500 
12 
II 
6 
Self 
18.625 
13 
III 
7 
Cross 
22.125 
14 
III 
7 
Self 
18.625 
15 
III 
8 
Cross 
20.375 
16 
III 
8 
Self 
15.250 
17 
III 
9 
Cross 
18.250 
18 
III 
9 
Self 
16.500 
19 
III 
10 
Cross 
21.625 
20 
III 
10 
Self 
18.000 
21 
III 
11 
Cross 
23.250 
22 
III 
11 
Self 
16.250 
23 
IV 
12 
Cross 
21.000 
24 
IV 
12 
Self 
18.000 
25 
IV 
13 
Cross 
22.125 
26 
IV 
13 
Self 
12.750 
27 
IV 
14 
Cross 
23.000 
28 
IV 
14 
Self 
15.500 
29 
IV 
15 
Cross 
12.000 
30 
IV 
15 
Self 
18.000 
(p.10) A good place to start is usually by plotting the data in a way that makes sense in terms of our question—in this case by plotting the data divided into the crossed and selfed groups (Fig. 2.1). R has some graphical functions that come as part of the packages that are automatically installed along with the socalled base R installation when you download it from the CRAN website. However, I am going to take the opportunity to also introduce Hadley Wickham’s ggplot2 (Grammar of Graphics, version 2) package that is widely used throughout this book. While ggplot2 has an allsinging alldancing ggplot() function it also contains a handy qplot() function for quickly producing relatively simple plots (and which will take you a surprisingly long way). One advantage of this qplot() function is that its syntax is very similar to that of the base R graphics functions and other widely used R graphics packages such as Deepayan Sarkar’s Lattice. Luckily, ggplot2 is supported by a comprehensive website and book so it is easy to expand on the brief introduction and explanations given here. If you do not have the ggplot2 package on your computer you can get it by rerunning the install.packages() function given earlier but substituting ggplot2 in place of SMPracticals. Notice that the qplot() function has a data argument, and one restriction when using ggplot2 is that everything we want to use for (p.11) the plot must be in a single data frame (in this case everything we need is present in the darwin data frame but if it were not we would have to create a new data frame that contained everything we want to use for our graphic):
 > library(ggplot2) # activate package
 > qplot(type, height, data= darwin, shape= type, colour=> type)+theme_bw()
 > ggsave(file= "Hector_Fig21.pdf") # save graph as pdf
Some of the advantages of ggplot2 over the base graphic functions are immediately obvious in how simple it is to use different symbol shapes, colours, and backgrounds (deleting the theme_bw() command for the black and white background will reveal the default theme_grey() setting that is handy when using colours like yellow that do not show up well against white) together with an automatically generated key with a legend. Even better, notice how easy it is to save the file (in various types, sizes, or rescalings) using the handy ggsave() function.
Figure 2.1 suggests that the mean height may be greater for the crossed plants, which would be consistent with a negative effect of inbreeding. But how confident can we be in this apparent signal of inbreeding depression given the level of noise created by the variation within groups? The variability seems reasonably similar in the two groups except that the crossed group has a negative value lying apart from the others with a height of 12—a potential outlier. Actually, as we will see in Fig. 2.2, there are two values plotted on top of one another at this point (see the ggplot2 website and online supplementary R script at <http://www.plantecol.org/> for the use of geom(jitter) as an additional argument to the qplot() function that deals with this issue). This is typical of many data. The outlying negative heights could be due to attack by a pest or disease or because somebody dropped the pot, accidentally damaged the plant, or simply took the measurement incorrectly. It is hard to say anything more specific using this eyeball test since the difference between treatment groups is not that dramatic and there is a reasonable degree of variability within groups, not to mention (p.12) those troublesome outliers. The ANOVA will quantify this signaltonoise ratio and how confident we can be that we are not being fooled by a false positive: an apparent difference where one does not really exist (a ‘type I error’). The larger the difference relative to the variability the more confident we can be.
2.2 Least squares
The general strategy of ANOVA is to quantify the overall variability in the data set and then to divide it into the variability between and within the self and crosspollinated groups and to calculate a signaltonoise ratio. The greater the ratio of the signal to the noise the more confident we can be that we have detected a real effect. To quantify the variability we can use a method called least squares that begins by measuring the differences from the individual data points to some reference point. The most intuitive one is the overall grand mean, but any value will do and as we will see later statistical software usually chooses a different benchmark. The method is probably better illustrated in the context of regression (see Chapter 4) but the name conveys the idea that the analysis finds the lines of best fit, in this case the treatmentlevel means, by minimizing the overall squared distances to the individual data points. Of course, we cannot simply add up the differences from each data point to the mean and use this as a measure of variability because the negative differences will always cancel the positive ones and consistently result in a value of zero! We could take the absolute values, but early statistics instead solved this problem by squaring the differences so that the negative values become positive. The resulting value—the sum of the squared differences, or sum of squares (SS) for short—is our measure of the overall variability in the data.
Figure 2.2 plots the heights of the plants together with the overall mean and the difference between each value and this average. To save space, the R code for this figure is presented in the online supplementary R script at <http://www.plantecol.org/>. However, as indicated by its name, ANOVA works with another quantity, the variance (usually symbolized as sigma (p.13) squared, σ^{2}). The variance is also called the mean square (MS) because it is an average amount of variation: it might be useful to think of it loosely as a per data point average amount of variation. I say ‘loosely’ because the SS is actually averaged using a mysterious quantity called the degrees of freedom (DF; Box 2.3). The total number of DF is one less than the number of data points (30 – 1 = 29).
However, there’s a catch: because of the squaring necessary to calculate the SS this estimate of the variability is not on the same scale as the original (p.14) data. Instead, we have a measure of variation on the squared scale: that is not inches but square inches! So, if we used the SS to compare the variability in Darwin’s sample with the mean value we’d be comparing average height in inches with the variability in square inches—something we would associate more naturally with an area rather than a height. Luckily, this problem is easily solved by reversing the earlier squaring: by taking the square root of the variance we get the standard deviation (SD, S, or sigma, σ) which is a measure of variability on the same scale as the original data and which can be conveniently compared with the mean (or some other measure of central tendency). R has functions for calculating the mean and the SD (notice in the code the use of the dollar sign to separate the names of the data frame from the column of interest and that putting the command inside parentheses causes the result to be printed):
 > mean(darwin$height)
 [1] 18.88333
 > sd(darwin$height)
 [1] 3.180953
The SD is a descriptive statistic that quantifies the variability in the heights of the 30 plants and which is on the same scale as the original measurements and the mean. It can be thought of as the average distance (p.15) between the mean and a data point. We will look at it again later in Chapter 5. R also has functions for the variance and for taking square roots that we can use to check the relationship between the SD and the variance:
 > var(darwin$height)
 [1] 10.11846
 > sqrt( var(darwin$height))
 [1] 3.180953
To recap, we have seen how the least squares process minimizes the SS to find the best estimates and how it uses these SS and DF to calculate a mean square, or variance. Finally, taking the square root of the variance produces the SD, which is a measure of the variability in a sample that is conveniently on the same scale as the original measurements. However, by treating the 30 values as a single sample of plant heights we are ignoring the question of interest: do we in fact have two samples with different mean heights due to inbreeding?
2.3 Differences between groups
One useful approach for graphically describing and summarizing data in different treatments is the boxandwhisker plot, invented by the statistician John Tukey. A ggplot2 box plot of Darwin’s maize data produces Fig. 2.3:
 > qplot(type, height, data= darwin, geom= "boxplot",
 > main= "Darwin's Maize: Box&Whisker plot",
 > xlab= "Fertilization",
 > ylab= "Height (inches)")+theme_bw()
Box plots provide a rich summary of the data giving the medians (the horizontal lines within the boxes) and the 25th and 75th percentiles (the top and bottom of the box); the ends of the whiskers separate the main body of the data from outlying values. We mostly work with means, but when data (p.16) are skewed the median (the halfway point) may provide a more appropriate measure of central tendency. The graph is certainly suggestive of differences between groups, but box plots are descriptive and a different approach is required to make inferences about the treatment effects.
2.4 Data in long and wide formats
The darwin data frame is in long format where each data point is on a separate row. This is usually the data format we need for conducting linear model analysis in most statistical software. However, you may have the data in a spreadsheet in wide format—with the heights for the crossed and selfed plants in different columns—or need to organize them in that way. The reshape package is designed for rearranging data sets and making statistical summaries of them. At its core is the idea of melting data sets into molten form using the melt() function and then casting them into a new arrangement or a summarized form using cast(). First we create a molten (p.17) form of the data set that classifies the columns into the identity variables that define its structure and measure variables—in this case just our single response variable:
 > library(arm)
 > mDarwin < melt(darwin, id.vars= c("pot", "pair", "type"),
 > measure.vars="height") # not shown
Then we can cast a wide version of the data set that specifies pot and pair to be the rows (given on the lefthand side of the tilde, ~) and forming a column of heights for each pollination type (as specified on the righthand side of the tilde—note that because we only measured one response the column labelled ‘variable’ has only this one repeated entry all the way down):
 > darwide < cast(mDarwin, pot + pair ~ variable + type)
 > head(darwide)
pot
pair
height_Cross
height_Self
1
I
1
23.500
17.375
2
I
2
12.000
20.375
3
I
3
21.000
20.000
4
II
4
22.000
20.000
5
II
5
19.125
18.375
6
II
6
21.500
18.625
The head() function allows us to just output the first few rows of the reshaped data frame. By substituting darwide$height.cross and darwide$height.self into the functions given earlier for the mean and SD of the long version of the data we can see that the crossed plants have a height of 20.2 while the selfed plants have a lower mean height of 17.6, a shortfall of about 2.6 inches. The question is: how confident are we that this difference reflects negative effects of selfing? To judge this we have to assess the signal (the difference between treatment means) relative to the noise (the level of variation within the samples). ANOVA uses variance to quantify the signal and noise, but to get a first impression of the level of variability on the same scale as the means we can apply the sd() function to the wide data frame to get values for the SDs of the crossed and selfed samples of 3.6 and 2.1 inches, respectively (see Section 2.2).
(p.18) 2.5 A first linear model analysis: ANOVA
Although this book points out the distinction between ANOVA, analysis of covariance (ANCOVA), and linear regression one key point that I want to try and get across is that it is better to think of these as closely related forms of linear model analysis that are best handled in a general way. Luckily, R is set up to work exactly like this and we can perform all of these types of analysis using the linear model function, lm(). A good, if slightly unconventional, place to start is to use the lm() function to estimate the grand mean and total SS that were explained earlier in Section 2.2. We have already created R objects and assigned values to them—for example, we created an R data frame object called darwide and assigned it the values of the reordered darwin data frame. We are now going to use the lm() function to perform a linear model analysis and assign it to a linear model object that we will call ls0 (for least squares model 0—not to be confused with the R list function, ls()):
 > ls1 < lm( height ~ 1, data= darwin)
The arguments of the lm() function specify that we want to analyse a response (height) as a function of (~) an explanatory variable. Notice the use of the 1 to specify that in this case we do not have an explanatory variable and simply want to estimate the grand mean. Also notice how the data argument saves us having to add the name of the data frame followed by a dollar symbol in front of every variable name. I recommend that you use the data argument whenever you have the opportunity unless there is a good reason not to.
The statisticians Andrew Gelman, Jennifer Hill, and their colleagues have written a handy display() function as part of their arm package to extract the key features of linear models for teaching purposes, so we’ll begin by using this:
 > library(arm)
 > display(ls1)
coef.est
coef.se
(Intercept)
18.88
0.58

n = 30, k = 1
residual sd = 3.18, RSquared = 0.00
(p.19) The headings to the two columns of numbers indicate that these are estimates of model coefficients with their standard errors (we shall explore standard errors later). The first row of this type of R output is always generically labelled as ‘(Intercept)’ and in this case the intercept is simply the overall grand mean that we have already calculated using the mean() function. The sample size, n, gives the number of data points (and rows in the data frame) and the number of parameters estimated—the grand mean—is given by k. We know that the SD is the square root of the variance so could 3.18 be the square root of the variance we estimated earlier using the var() function (see Section 2.2)?
 > ( 3.18^2 )
 [1] 10.1124
Bar a bit of rounding error, yes. That was a useful exercise for demystifying what is going on but what we really want is a linear model that analyses a response—differences in plant height—as a function of a categorical explanatory variable, namely the type of pollination (with two categories, crossed and selfed). We can create an object for this analysis, called ls1:
 > ls1 < lm(height~ 1 + type, data= darwin)> # or, equivalently: lm(height~ type…)
This linear model analysis makes a few assumptions that we will gradually explore later in this book. If the assumptions are badly contravened then the output of the analysis may not be entirely trustworthy. Before looking at the results of an analysis it is therefore a good idea to check that the assumptions have been satisfactorily met and consider the likely consequences of any substantial infringements. However, we will postpone (p.20) this until Chapter 3 since we have already seen that one of the assumptions may be contravened—the variability in the two treatments may be unequal, particularly due to the presence of two apparently outlying data points in the crossed treatment.
The model formula now contains type in addition to the intercept indicated by the 1. Actually, we could omit the 1 since R will automatically include it for us (if you compare this model with one that omits the 1 you will see the output is identical—try it). However, the intercept is no longer the grand mean as in model ls0, as we can see from the display() function output:
 > display(ls1)
Coef.est
coef.se
(Intercept)
20.19
0.76
typeSelf
2.62
1.07

n = 30, k = 2
residual sd = 2.94, RSquared = 0.18
Now there are two rows of numbers in the output, so what is the intercept? We can work out what the intercept is in the style of Sherlock Holmes—by eliminating all of the alternative possibilities bar one. The label of the second row, typeSelf, is produced by giving the name of the explanatory variable (type) followed, with no spacing, by ‘Self’—the name of one of the levels of this factor (a factor is a categorical explanatory variable). Since type has only two levels the intercept must be the other, and could be relabelled as typeCross (Box 2.4).
(p.21) So the coefficient in the row labelled intercept is the average height of the 15 maize plants in the crossed treatments. A common mistake is to think that the value in the second row is the height of the selfed plants. This would be a fair assumption, as the label typeSelf implies that is what it ought to be. However, in this case it should be obvious to anyone with their wits about them that this cannot be the mean height of the selfed plants since the value is negative! Instead, the output shows the mean value of one of the factor levels (by default whichever comes first numerically or alphabetically, see Box 2.5) and the difference between this value and the mean of the other treatment level. So in this case the intercept refers to the crossed plants and the second row is the difference in the mean height of the two groups. This may seem a bit odd at first sight but there are good reasons for doing it this way, and in this case it also focuses our attention on the question of interest: is there any difference in the heights of the plants depending on whether pollination was through selfing or outcrossing? Note that the output does not give the overall grand mean that we calculated earlier: our question does not involve the grand mean and estimating it would waste one of our precious DF. Instead, one of the treatmentlevel means is taken as a baseline for the calculations (in this case: type = Crossed). The second column gives the standard errors of the values in the first column. Since the first column of numbers contains a mean and a difference between means we now know that the second column gives a standard error of the mean (SEM) and a standard error of a difference between the two means (SED).
(p.22) One potentially confusing issue is the difference between SD and standard error. A standard error is simply the SD of a statistic—so the SEM is a SD of a mean. The SEM quantifies the variation we would expect to get if we repeatedly estimated the mean using different samples. Recall from earlier that the SD is the square root of the variance. The standard error is related to the variance in a similar but slightly more complex way
or equivalently
where n is the number of data points used to calculate the variance. What does n do in the calculation of the standard errors? For a given level of variability—as reflected in the variance and SD—dividing by the square root of the size of the sample will produce smaller standard errors for larger samples relative to smaller ones. This makes good intuitive sense: we have more confidence in the mean of a large sample than that of a small sample (just think about how much confidence you would have in predictions from polls of voting intentions based on small or large samples of people).
The SED between the means of two groups is
where the subscript ‘a’ indicates values for the crossed plants and ‘b’ those for the selfed plants. Here, because sample sizes are equal, the SED is the square root of two (~1.4) times the SEM. This is a very handy relationship to remember, as it will enable us to convert between different types of error bars and confidence intervals, as we will see in Chapter 5.
The lower half of the display in Box 2.5 gives the number of parameters, k, as 2, which follows from the two treatment levels in our explanatory factor. It finishes by giving a ‘residual’ SD and the R squared. The R squared is the (p.23) proportion of the total sum of squares (SST) explained by the statistical model, in this case by the pollination treatment. Notice that the value of the residual SD (2.94) is somewhere between the SD values of the two treatments we calculated earlier (3.6 and 2.1); we’ll come back to this in a moment.
One limitation is that we don’t have the mean and standard error of the other treatment level. It is easy enough to calculate the mean by simple addition or subtraction but the standard error cannot be calculated in this way. Instead, we can get R to estimate the mean and standard error for the selfed treatment by using the relevel() function to override the default assignment of the crossed treatment as the intercept (Box 2.8).
 > darwin$type < relevel(darwin$type, ref= “Self”)
If we then rerun the linear model and display commands we get the following changes in the output:
Analysis of Variance Table 

Response: height 

Df 
Sum Sq 
Mean Sq 
F value 
Pr(>F) 

Type 
1 
51.352 
51.352 
5.9395 
0.02141* 
Residuals 
28 
242.083 
8.646 

 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 coef.est coef.se
 (Intercept) 17.58 0.76
 typecross 2.62 1.07
Now the selfed treatment level is taken as the intercept and we get its point estimate and standard error together with the same difference in means (with the opposite sign) and SED as before.
Notice that the SEM of the selfed treatment is the same as that for the crossed treatment. How can this be when earlier on we found the variances and SDs for the two treatment levels to be different? One key feature of (p.24) ANOVA is that it calculates a pooled estimate of the socalled residual variation within treatments. The logic is that by estimating the variance using all treatments we can get a better estimate of the noise than if we did the calculation for each group separately (the smaller the group the poorer the estimate). We can see how this works by looking at an ANOVA table. An ANOVA table summarizes the least squares calculations we worked through earlier and performs Ftests using the variances.
2.6 ANOVA tables
ANOVA tables can be produced in R for analyses conducted with the lm() function by using the anova() function:
 > anova(ls1)
Coefficients:
Estimate
Std. Error
t value
Pr(>t)
(Intercept)
20.1917
0.7592
26.596
<2e16
typeSelf
2.6167
1.0737
2.437
0.0214
First, notice that while the display output contained direct information about the biology (the average height of the selfed plants and the average difference in heights) the ANOVA table only contains information about the statistical analysis and should therefore be of less interest to us. The first column gives the source of the variation, in this case the pollination treatment or the unexplained variation that is left, i.e. the residual. The numbers in the next three columns—the DF, sums of squares, and the mean square (i.e. variance)—have already been introduced and briefly explained in relation to calculating the overall variation in the data. Unlike some statistical software packages, R does not give a final row for the total DF and SS; if we wanted to add these to our table we could simply take (p.25) them from the anova() function output for the ls0 model given earlier (it is easy to check this is correct since SS and DF are both additive—try it). We shall look at how we decompose the total SS into two parts in a moment, but first notice that there is a single residual mean square (variance) that is calculated by pooling the variation within both samples in order to get a better estimate of the noise than would be the case when calculating the variance separately for both groups based on the smaller sample size (as we did earlier). This is particularly valuable when we have a greater number of treatments with smaller sample sizes, and was one of the key ideas in the development of ANOVA. This pooled estimate obviously lies between the estimates of the variances calculated earlier for each treatment separately. ANOVA always uses the pooled estimate of the variance. For example, when calculating the standard error for both treatment levels we plug the pooled estimate (8.646) into the equation given above (p22, top) when calculating the standard errors for both (all) treatments. Because the treatment sample sizes are also the same (n = 15) both levels have the same SEM. In general, when using ANOVA—or any related linear model analysis that employs a pooled estimate of the residual variance—when the sample sizes per treatment level are the same then the SEMs will have the same value. Conversely, if the SEMs vary then the sample sizes must vary.
When introducing ANOVA I said that the general strategy was to decompose the overall variance that we have calculated into between and withingroup variances. How do we do that? The overall variation was quantified by calculating the sums of squared differences from each data point to the overall grand mean (Fig. 2.2, left). We have already seen that statistical software generally does not use the overall grand mean because this may not be of interest and would waste a precious DF. Instead, statistical software typically selects one of the treatment means as a baseline for the calculations (in this example R selects the crossed treatment level by default since it follows numerical and alphabetical order). Nevertheless, since it is usually seen as the most intuitive baseline, and since we have already used it, let’s continue in the same fashion using the grand mean. Another way to look at the general idea of the analysis is that it asks how much of the variability in the data (p.26) we can explain simply by categorizing the data points into the two treatment levels, crossed and selfed. To calculate the SS explained by the pollination treatment we essentially replace the data points with their expected or ‘fitted’ values—these are simply the treatmentlevel means (Fig. 2.2, centre). After making this change we then repeat the least squares process introduced above: measure the distance from the grand mean to each fitted value (i.e. the relevant treatment mean), square, and add up to produce the SS for our treatment (SSA; typically the first treatment is called A and so on). One key thing to remember about SS is that they are additive. Because we now have the total and one part of it, what is left over (the residual) can be worked out by subtraction. What is left is the variation within treatments; that is the variation that cannot be explained by the treatmentlevel means. To calculate it directly we repeat the least squares process one more time but now simply measuring the difference from each data point to the relevant treatmentlevel mean (Fig. 2.2, right). Summing these squared distances produces the residual SS, which is more commonly abbreviated to its alternative name, the error sum of squares (SSE).
We have already introduced the calculation of the mean square (variance) by dividing the SS by the DF, and we have learnt that the DF is one less than the number of values used for the calculation (DF = n – 1) in this case 30 – 1 = 29. In the ANOVA table, the total of 29 DF has been decomposed into a single DF for the pollinationtype treatment and 28 (29 – 1) for the residual. Notice that when calculating the DF for the treatment we subtract one from the number of factor levels (not the number of data points), in this case 2 – 1 = 1. The residual variance from the ANOVA table is simply the square of the residual SD from the arm package display() function output given above.
2.7 Null hypothesis significance testing
Finally, we are ready to put Darwin’s hypothesis on inbreeding depression to the test. The ANOVA uses the differences between treatments and the variation within groups to calculate a signaltonoise ratio: the larger this (p.27) variance ratio is, the more confident we can be that the effect of the treatment is real and not a false positive. The signaltonoise ratio is calculated by dividing the treatment variance by the residual error variance to produce the Fvalue (originally called the variance ratio but later renamed in Fisher’s honour). In this case, the value of 5.9 means the estimated signal is nearly six times larger than the estimated noise. In a perfect world, if there were no signal then the treatment variance would be nothing more than another measure of the error variance. If we could estimate both the signal and noise perfectly then the variance ratio produced when there was absolutely (p.28) no effect of the treatment would be equal to one (the null expectation). In practice, since we are estimating the signal and noise from samples, the treatment variance will sometimes be estimated (or rather underestimated) to be smaller than the error variance, and occasionally we will get Fratios less than one (‘negative variance components’). So long as we are confident there are no problems with our data and analysis then we can do little more than write these off as sampling error. Of course, the variance ratio could also be bigger than one simply due to sampling error, but the larger the signal is relative to the noise the more confident we can be that it is not a false positive. The probability value, P, quantifies this for us. The probability value of 0.02 says that there is a 2% chance of having data that would produce a signaltonoise ratio of this size (or larger) if there was actually no effect of pollination type. Essentially there is a 2% chance of our result being a false positive.
The probability value in the ANOVA table is calculated from the Fdistribution. The probability assigned to a given Fvalue depends on the DF of both the signal and the noise, which in turn depend on the sample size and complexity of the statistical model. We can see this using the R pf() function to recreate the Pvalue in the ANOVA table:
 > pf(5.9395, 1, 28, lower.tail= FALSE)
 [1] 0.02141466
The last argument in the function is set to give us the probability of being in the tail of the distribution with an Fvalue equal to or greater than the observed value (5.9). Because of this it is important that when reporting results we don’t just give probability values but present them together with the Fvalue and its two sets of DF, something like this:
> The height of the crosspollinated plants was significantly greater than that of the selfpollinated plants (F_{1,28} = 5.9; P = 0.02).
We could further improve on this presentation by saying something about the height of the plants in the different treatments and not just that the average height of one treatment was significantly bigger than the other (Box 2.8):
(p.29) The display() function from the arm package can be thought of as a teaching version of the summary() function. We can compare the differences in the output of these functions (notice the use of the options() function to turn off the significance highlighting stars based on the earlier discussion):
 > options(show.signif.stars= FALSE) # turn significance level > # stars off
 > summary(ls1)
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Pair
14
86.264
6.162
0.5536
0.8597
Type
1
51.352
51.352
4.6139
0.0497
Residuals
14
155.820
11.130
 Residual standard error: 2.94 on 28 degrees of freedom
 Multiple Rsquared: 0.175,Adjusted Rsquared: 0.1455
 Fstatistic: 5.94 on 1 and 28 DF, pvalue: 0.02141
The estimates and standard errors are the same as in the display function but are shown to more decimal places by default. However, the summary function includes test statistics and probability values. The lower Pvalue of 0.0214 is the same as that produced by the Ftest given above, even though it is attributed to a tvalue produced by a ttest rather than the Ftest used in our ANOVA! Clearly, the ttest and the Ftest produce equivalent Pvalues, at least for a simple situation like this (we will see later that this is not always the case in more complex settings). As we will see in (p.30) Chapter 3, while the Ftest works with the variance on the squared scale the ttest works with the estimate and SEs on the original scale of measurement. It would therefore seem to follow that F = t^{2}; is this true?
 > (2.437^2)
 [1] 5.938969
Bar a bit of rounding error, yes! So, for a simple oneway ANOVA like this the ttest presented in the second row of the summary() function output given above is equivalent to the Ftest in the ANOVA table for asking whether there are significant differences between the two treatmentlevel means. We’ll postpone looking at the ttest in greater detail until Chapter 3. The residual standard error is the same as the residual SD given in the display() function output. The multiple R squared is the proportion of the variation explained by our model as quantified by the SS—in this case pollination type explains nearly 18% of the total variation in height as quantified by the SST. The adjusted R squared is a related measure that takes into account the complexity of the model—we’ll meet this again later together with information criteria. The last line of the summary() function output produces an Ftest for the whole model. In this case, the model only contains the effects of pollination type so it simply repeats the result we have already seen.
If you look back to the start of this chapter you will see that the data set has additional information: in particular, equal numbers of crossed and selfed plants were planted in each pot. Essentially the plants are grouped into pairs, one from each treatment. In Chapter 3 we will analyse these data using the paired ttest, as Fisher did, and we can add the pairing to the ANOVA as follows:
 > anova(lm( height~ pair + type, data= darwin)
You can see that adding the pairing to the analysis takes up a lot of DF and it changes the calculation of the mean squares. We are not interested in the (p.31) pairing in itself, but notice how the pairs are less variable than we would expect based on the residual error (the mean square for pair is only around half of the residual error). Pair therefore has an Fvalue smaller than the expected value of 1 (indicating a ‘negative variance component’). Situations like this can arise due to sampling error but may also alert us to problems with the experimental design (see Box 2.9). As a consequence, in this case adding pairing unexpectedly lowers the significance of pollination (p.32) type compared with the unpaired analysis, but it remains just below the arbitrary 5% mark (if it rose further and lay between 5 and 10% we might refer to it as marginally significant). Notice that type is now tested with only 14 DF for the residual error which reduces the statistical power of the test—you can demonstrate this for yourself using the pf() function introduced earlier if you keep the Fvalue constant and vary the error DF. In this case, due to the negative variance component, the cost in terms of DF of adding the pairing to the model outweighs the potential benefits we would expect in general due to a reduced residual error.
2.8 Summary
Let us recap what we’ve done in this chapter. Our question, following Darwin, was whether inbreeding reduces fitness. In the experiment we use the height of maize plants as a surrogate for their fitness and ask whether it is reduced by selfpollination. The least square process estimates the lines of best fit—the means for each treatment—and quantifies the variation in the data using SS. The SST can be decomposed into the variation between treatments (the signal) and the residual variation within treatments (the noise). We get average amounts of variation by calculating the mean squares—or variances—by dividing the SS by the DF. The variance ratio (or Fvalue) quantifies the ratio of the signal to the noise where the null expectation is a value of 1. As values get larger than 1 the probability of this happening by chance (a false positive) gets smaller and we become more confident that we are seeing a real effect. In our case the Fvalue was nearly six and the probability of this being a false positive with the data set and statistical model used is less than 5%—the conventional point at which to declare a result statistically significant. Notice that this arbitrary level of statistical significance does not necessarily mean that the effect is biologically significant. Our enhanced presentation gave the mean heights of the two treatments and the difference between treatments, but the ANOVA approach put the emphasis on the Pvalue and on passing arbitrary levels of significance. In Chapter 3 we will explore the ttest further and see how (p.33) the standard errors that it is based on can be used to give an approach that focuses less on the Pvalue and level of significance and more on the estimated heights and differences together with their uncertainties. In this example the analysis seems to support the hypothesis that inbreeding is detrimental to fitness (although the result is close to the conventional boundary of significance and both treatments have some outlying values). (p.34)