Jump to ContentJump to Main Navigation
A Biologic Approach to Environmental Assessment and Epidemiology$

Thomas J. Smith and David Kriebel

Print publication date: 2010

Print ISBN-13: 9780195141566

Published to Oxford Scholarship Online: January 2011

DOI: 10.1093/acprof:oso/9780195141566.001.0001

Uncertainty in Measuring Risk

Chapter:
(p. 198 ) 8 Uncertainty in Measuring Risk
Source:
A Biologic Approach to Environmental Assessment and Epidemiology
Author(s):

Thomas J. Smith

David Kriebel

Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780195141566.003.0008

Abstract and Keywords

This chapter reviews several important sources of uncertainty in epidemiologic studies and summarizes approaches to managing these uncertainties. It identifies five different sources of uncertainty in measures of association calculated from epidemiologic studies: errors in exposure measurement, confounding, biases, model misspecification, and systematic and random errors in outcome.

Keywords:   environmental epidemiology, epidemiologic studies, disease process, sources of uncertainty

The only way we have of studying the unknown is by pretending that it is like the known. That the unknown is like the known makes science possible; that it is also unlike the known makes science necessary. This conflict is the reason that all theories are eventually proven to be wrong, limited, irrelevant, or inadequate.

R. Levins (1995)

Chapter 7 described the principal methods of studying disease in populations, but for the most part, it did not deal with the many sources of uncertainty and error in the estimation of risk in populations. Because so much of what we know about environmental health risks comes from observational studies and not experiments, problems of chance, bias, and confounding can seriously compromise a study’s findings. A thorough investigation of potential errors is an essential component of every study. We use the term uncertainty as the concept that overarches bias, confounding, chance variation, and all of the other sources of error, or challenges to the validity of a study. There is a growing literature on uncertainty in epidemiology and related fields, including risk assessment (Morgan and Henrion 1992; Levins 1995; Cairns and Smith 1996; Bailar and Bailer 1999; Stayner, Bailer et al. 1999). When drawing a conclusion about a hypothesized association between some exposure and a disease, a researcher should be able to describe how certain he or she is about the findings. Rarely is this certainty expressed in a simple, quantitative way, although Bayesian statistical methods are moving in this direction (Carlin and Louis 1996; Greenland 2001). (p. 199 )

In this chapter, we review several important sources of uncertainty in epidemiologic studies and summarize approaches to managing these uncertainties. Sometimes we can directly estimate the magnitude of errors that may exist in our quantitative data. At other times, less formal methods are available for qualitatively evaluating the range of possible errors or assessing the sensitivity of our results to the assumptions that have been made during data gathering and statistical modeling. At a minimum, it is important to describe potential errors and our best judgment about their magnitudes, so that the reader of an epidemiologic study will understand the potential limitations in the findings.

There are at least five different sources of uncertainty in measures of association calculated from epidemiologic studies (Table 8.1). First, there are errors in exposure estimation. Both systematic and chance variations can lead to differences between the true exposure and what is measured. Reducing these errors is one of the main goals of the methods presented in this book. Second, the problem of confounding introduces uncertainty because there is always the possibility that either a known confounder has not been adequately controlled or, even more likely, that there are additional confounders whose existence is not known. It is only randomization that enables the researcher to control confounding by unmeasured confounders. In observational studies in which randomization is not possible, therefore, one must always remember that any result may be confounded by unknown factors.

The third source of uncertainty is the possibility of various types of bias, and these are briefly reviewed. A fourth source of uncertainty, often not discussed in epidemiologic texts, is the problem of misspecification of the form of the epidemiologic model and of the models used to estimate doses. Finally, there are both systematic and random errors in the outcome variable. Random or chance variation in the number of cases of disease (also called sampling variability) is the most widely discussed of all of these sources of uncertainty in epidemiology. It can be quantified through the calculation of confidence intervals or p-values.

Table 8.1 Sources of uncertainty in environmental epidemiology studies

Source

Can Uncertainty Be Quantified?

Principal Methods to Evaluate/Control

Errors in exposure measurement

  • partially

  • validation studies

  • calibration dosimetry

  • sensitivity analyses

Confounding

  • yes, for known confounders

  • no, for unknown confounders

  • known confounders: stratification, regression

  • unknown confounders: randomization

Biases

Model misspecification

  • no

  • partially

  • design study to avoid potential biases

  • sensitivity analysis

Systematic and random errors in outcome

  • yes, for random errors (sampling variability)

  • partially, for systematic errors

  • random errors: confidence intervals

  • systematic errors: validation studies, sensitivity analyses

(p. 200 ) 8.1. ERRORS IN EXPOSURE ESTIMATION

The mismeasurement of exposure, or of dose metrics estimated from exposure data, is a very serious problem in epidemiology and one that is, to a great degree, the motivation for this entire book. Mismeasurement of exposure is often termed misclassification of exposure. This term alludes to the classification or categorization of an exposure, but it is also used when exposure is measured on a continuous scale. There are at least two distinct opportunities for misclassification of exposure in an epidemiologic study. First, the actual measurements of exposure intensity or duration may be incorrect, for one of many reasons. Second, even when the actual exposure measurements are of the highest quality, one may still misclassify exposure as it is used in an epidemiologic study if one chooses an incorrect summary measure of exposure, or dose metric, for the epidemiologic model. Because this second misclassification opportunity is not well appreciated, a brief example may help.

There is a growing body of evidence suggesting that exposure to electromagnetic fields (EMF) may increase the risk of several diseases, including brain cancer and leukemia (Feychting, Ahlbom et al. 1998; Savitz 2001; Savitz 2002; Savitz 2003). A large number of studies have been conducted, some with very sophisticated exposure assessments, based on extensive personal monitoring data for the subjects in large case-control, as well as cohort, studies. This level of detail has been possible because the equipment to measure EMF exposure intensities on different wavelengths and time scales is relatively inexpensive and portable and can gather large quantities of data very rapidly. But the existence of large quantities of EMF data does not eliminate the possibility of exposure misclassification, because it is not clear how the electromagnetic field data should be summarized (Wenzl, Kriebel et al. 1995; Kromhout, Loomis et al. 1997). The problem is that, unless one has a detailed hypothesis of the disease mechanism, it is difficult to choose the summary measure of exposure (Savitz 2003). In the case of electromagnetic fields, the problem is even more complex than for a chemical exposure because there are additional characteristics of fields that may well affect their toxicity. These include different measures of the frequency of the radiation and not just the amplitude or intensity of the field. These choices are difficult to make in the absence of a clear disease hypothesis. It is also difficult to choose appropriate time windows.

Differential Exposure Misclassification

There are two fundamental types of exposure misclassification. The first type is differential or systematic exposure misclassification, and the second is called nondifferential or random exposure misclassification (Pearce, Checkoway et al. 2006). Differential exposure misclassification occurs when exposure data are measured or otherwise misclassified in a way that is different for those who are diseased and those who are not diseased. When this occurs, it can introduce serious (p. 201 ) bias into effect estimators and, furthermore, the direction of that bias will often be difficult to predict. In a case-control study, differential exposure misclassification can be difficult to avoid if interviews with subjects are the source of exposure information. Patients with serious diseases have often been shown to remember exposures differently than healthy volunteers serving as control subjects. Similarly, interviewers may have difficulty remaining unbiased in their data gathering techniques if cases are seriously ill whereas controls are healthy. This problem is often termed recall bias, or information bias (see later in the chapter), but it can also be thought of as differential exposure misclassification. The best approach to dealing with this source of error is to avoid it through appropriate study design—gathering exposure data through means that do not involve patient interviews, for example.

It is often relatively easy to avoid differential exposure misclassification in cohort studies. The key is to collect all measurements of exposure before disease occurs or with methods that can be conducted blind to disease status. As long as the exposure estimation process is conducted independently of the assessment of disease, then differential exposure misclassification is unlikely to occur. A good example of the separation of the two assessment processes is the way that historical exposure reconstruction is often performed in occupational retrospective cohort studies. One team of investigators, often industrial hygienists, estimates the exposures in each job in a workplace and how these exposures have changed over time (Chapter 3). Any errors in these job exposure estimates will most likely be nondifferential. The job exposure estimates are then assigned to diseased and nondiseased members of the cohort through each subject’s job history. As long as these job histories are unbiased, then no differential exposure misclassification will occur. Job histories generally are taken from work records, which are unlikely to be somehow altered by disease status.

Differential exposure misclassification is a serious problem, and one that epidemiologists seek to avoid through appropriate study design. One reason that so much emphasis is placed on avoiding this type of bias in design is that its ultimate effects on measures of association are very hard to evaluate once the data are collected and the study completed. When reviewing a completed study in which differential exposure misclassification appears likely, one worries about the most fundamental aspect of this potential bias—its direction. That is, we often cannot tell whether differential misclassification is more likely to have biased a measure of association towards or away from the null—falsely inflating or underestimating the exposure-disease association.

Nondifferential Exposure Misclassification

The second type of exposure misclassification is termed nondifferential. This means that errors in exposure measurement are independent of disease status. The term random is sometimes used to describe nondifferential misclassification, but this is (p. 202 ) not quite accurate, as what distinguishes this type is that there are similar inaccuracies in exposure estimation for both the diseased and the nondiseased members of the cohort. The misclassification may or may not be random with respect to other subject characteristics. Nondifferential exposure misclassification is very common and perhaps universal. There are always errors or inaccuracies in how exposures are measured or categorized. What is important to understand, however, is how serious these errors might be in a particular study and to what degree they may have introduced error into the measures of association.

Nondifferential exposure misclassification will, in general, result in bias of the measure of association toward the null (Dosemeci, Wacholder et al. 1990; Brenner and Loomis 1994; Checkoway, Pearce et al. 2004). That is, if the true association is positive (RR greater than 1.0, for example), then the observed association will lie closer to 1.0 as a result of nondifferential exposure misclassification, and so too will a true RR below 1.0 be moved closer to the null by nondifferential exposure misclassification. This is an important principle with implications for the design and conduct of exposure assessment studies. We present two different illustrations of the phenomenon to aid in understanding of how it occurs.

Nondifferential Exposure Misclassification with Categorical Data

Imagine a closed cohort study of 200 individuals, 100 of whom are exposed and 100 of whom are not. The 200 are followed for a fixed period of time, and at the end of that time period, it is determined that 80 of the exposed and 40 of the unexposed have developed the disease (Figure 8.1, part A). One can calculate easily that in these simple data, the relative risk is 2.0. Now, imagine that when we conduct the epidemiologic study of these true data, there is an error in classification of the exposed and the nonexposed. Let us suppose that this error consists of an 80% probability of correctly classifying a person as exposed or unexposed and a 20% probability of incorrect classification. It is not difficult, then, to work out the impact of this error in exposure classification on the resulting relative risk (Figure 8.1, part B). An 80%-correct classification means that, for example, in the A cell of the 2-by-2 table, there will be only 64 exposed diseased individuals instead of the 80 that, in truth, belong to that cell. Similarly, for each of the other four cells, 80% of the original data belong in the table for the correctly classified subjects. To finish the simulation, we must determine the positions in the 2-by-2 table of the 20% of the subjects whose exposure was incorrectly classified. Starting again with the A cell, there are 16 individuals who were, in truth, diseased and exposed but were incorrectly classified as unexposed. These 16 are shifted from the upper left, or A, cell to the lower left, or C, cell. Similarly, the misclassified subjects in the B and D cells are inverted. When the resulting data are summarized, the misclassified relative risk is 1.5 (Figure 8.1, part C). As predicted, this relative risk is closer to the (p. 203 )

Uncertainty in Measuring Risk

Figure 8.1 Hypothetical data illustrating nondifferential exposure misclassification. Part A: These data represent the “true” exposure and disease status of a closed cohort of 200 persons followed for a fixed period. Part B: Exposure status has been misclassified. For 80% of subjects, exposure classification is correct, whereas for 20%, it is incorrect. Part C: The table of observed data, after exposure misclassification.

null, meaning an underestimation of the magnitude of the association. This pattern will hold for any amount of misclassification of a dichotomous exposure variable.

Several cautions are in order, however. First, one must always remember that this behavior will occur on average in data subjected to nondifferential misclassification (Jurek, Greenland et al. 2005). In a small to medium-sized dataset such as the one in Figure 8.1, it is always possible that, by chance, the RR with misclassified exposure data will turn out to lie further from the null than the true value. To evaluate how likely this is in any given dataset, one should look at the width of the 95% confidence interval around the RR. This is discussed further later. A second (p. 204 ) caution that must be mentioned is that if an exposure variable consists of more than two categories (low, medium, high, for example), then nondifferential exposure misclassification can bias measures of association in more complicated ways, with some category risk estimates overestimated and others underestimated. The net effect can, under certain circumstances, bias away from the null (Dosemeci, Wacholder et al. 1990). Furthermore, it is not uncommon for researchers to initially measure exposure on one scale and then later collapse this scale for epidemiologic analyses. If there was nondifferential misclassification on the initial scale, the collapsed scale may end up differentially misclassified if there are more than two categories, leading to bias either toward or away from the null.

Nondifferential Exposure Misclassification with Continuous Data

Now let us consider a slightly more complicated scenario. Imagine that exposure data are continuous rather than dichotomous. For this example, we assume that the outcome data are represented by a continuous measure as well. Let us assume that the outcome is some measure of physiologic function. When both exposure and outcome data are continuous, the measure of association that is typically used is the slope of the regression line between exposure and outcome. What will be the impact of nondifferential exposure misclassification on the slope of a regression model, fit to continuous exposure-response data? To illustrate this, we have simulated a dataset of 100 observations, exposure and outcome, each on an arbitrary scale (Figure 8.2). In the simulation, there is a strong positive association between exposure and outcome (the slope was 0.5, in the arbitrary units of these simulated data). One can see that as exposure increases, response increases also, with only a small amount of variability around the regression line. To simulate the impact of misclassification, random errors were added (or subtracted) to each exposure measurement in Figure 8.2. Each exposure data point was given a certain amount of error, which was randomly chosen to be up to 100% of its true value. That is:

measured exposure = true exposure ± (error × true exposure), where:

error is randomly drawn from a uniform distribution with limits (0 to 1).

When the measured exposure is used to regress response in these data, the result is a reduction in the slope or measure of association from 0.5 to 0.24 (Figure 8.3). One sees again the same pattern as in the first example: bias toward the null, or a weakening of the observed strength of association between exposure and response. In summary, exposure misclassification will, on average, lead to bias toward the null in both 2-by-2 tables and ordinary linear regression models, as long as the misclassification is nondifferential with respect to disease status. (p. 205 )

Uncertainty in Measuring Risk

Figure 8.2 Simulated data for illustrating nondifferential exposure misclassification in continuous exposure data. The graph represents the exposure-response relation among 100 subjects, each with a “true” measure of exposure, and a measure of physiologic function, Y.

One way to understand the underestimation of an exposure effect that often comes from nondifferential exposure misclassification is to think of it as diluting the exposed group. The error mixes up some subjects who were truly exposed with some who were truly not. If exposure is a risk factor for disease, then the truly exposed will be more likely to be diseased. Thus mistakenly calling some of these subjects “unexposed” moves some diseased subjects into the unexposed category, raising the overall risk in this group and “averaging out” the risks in the truly

Uncertainty in Measuring Risk

Figure 8.3 Simulated data, as in Figure 8.2, except that the exposure data have been misclassified by adding random errors. The observed slope, or measure of association, is biased toward the null.

exposed and truly unexposed groups. The problem will be particularly serious when small risks are being estimated. Most exposure definitions contain some (p. 206 ) error,and if these errors are substantial, they may lead to so much bias toward the null that weak exposure-response associations cannot be seen at all.

Evaluating and Correcting Errors in Exposure Estimation

All standard regression models assume that the independent variables are measured without error. One can see a manifestation of this on many xy graphs representing regression models (see, e.g., Figure 7.3). Such graphs often show vertical error bars, representing variability in the dependent variable, but rarely present horizontal error bars, which would indicate uncertainty in the x variable. It should seem obvious from what has been presented in the first half of this book that the implied assumption that exposure measurements are free of error is often violated in epidemiology (Jurek, Maldonado et al. 2006). In fact, it is so universally violated that one may wonder why this assumption is built in to regression—our most widely used statistical model. The answer is that the regression approach was developed in the context of experimental research, in which exposure or “treatment” was fixed by the investigator, who then observed the response and its variability due to sampling or chance.

When exposure data that contain errors, for example nondifferential exposure misclassification, are used in regression models, then the standard statistical measures of uncertainty, such as the confidence interval around the slope, do not include uncertainty that comes from these errors (Spiegelman, McDermott et al. 1997; Jurek, Maldonado et al. 2006). At a minimum, one should keep this in mind in interpreting confidence intervals and p-values in regression models. Better methods would include validation studies of the exposure assessment methods, formal adjustment for exposure measurement errors, or sensitivity analyses to understand the impact of plausible degrees of uncertainty in exposure on the study results.

Sometimes validation substudies are conducted through the comparison of two exposure methods: one expensive and highly accurate, the other more economical but less accurate. Chen and colleagues (Chen, Chang et al. 2004) conducted a study of low back injuries in taxi drivers in Taiwan. One exposure of concern was whole-body vibration (WBV). It was not feasible to conduct direct measurements of WBV in more than 1,000 drivers who participated in the study, and so a validation study was conducted for about 250 drivers, using direct WBV measurements in the taxicab. The aim of the validation study was to identify determinants of WBV exposure (the authors called it an “exposure prediction rule”) so that the WBV exposures of all study participants could be estimated, based on available information such as the age, make and model of car, size of engine, and so on.

When data from validation studies such as this one of WBV in taxicabs are available, then statistical methods can be used to adjust or correct the exposure-risk associations that are made using the economical but inaccurate exposure measure. (p. 207 ) A variety of different techniques are available for combining information from the validation and main studies (Armstrong 1995; Holford and Stack 1995; Armstrong 1998; Chatterjee and Wacholder 2002). Perhaps the simplest to understand is called regression calibration (Spiegelman and Valanis 1998). The method can be summarized as follows. Suppose we use a logistic regression model to quantify the association between a dichotomous outcome Y and an exposure Xapprox that is measured with error, although we will assume that it is not differentially misclassified (see previous discussion). The usual logistic model can be used: Equation 8.1 Logistic model for a dichotomous outcome logit[ p( Y=1 ) ]= β 0 + β 1 X approx

where β0 is a nuisance parameter, and eβ1 estimates the odds ratio for a one-unit change in Xapprox (Kleinbaum and Klein 2002). Now suppose that in a validation substudy, we have measured not only Xapprox, using the same methods to be used in the full study, but we have also measured Xtrue—or something as close to a “true” measure of X as is possible. This true value is often called a “gold standard.” Often, a truly gold standard cannot be obtained, but rather something we believe to be better, though not perfect. This is often termed an “alloyed gold standard,” and in this case more complicated methods are required (Spiegelman, Schneeweiss et al. 1997). If Xapprox and Xtrue are not highly correlated, then the odds ratio calculated from Equation 8.1 will be biased downward. This can be corrected in two steps: first, by regressing Xapprox on Xtrue: Equation 8.2 Calibration regression for Xapprox versus Xtrue X true =α+γ X approx

where α and γ are parameters estimated from the data. In the second step, the regression coefficient from Equation 8.1 is corrected for the measurement error quantified in Equation 8.2: Equation 8.3 Correction of the regression slope for measurement error β corr = β 1 γ

where βcorr is an estimate of the strength of the association between Y and Xtrue. When there are covariates in the exposure-response model (Equation 8.1), the formulas are more complicated, but the approach is conceptually the same (Rosner, Spiegelman et al. 1990).

Spiegelman and Valanis (1998) presented a good example of an application of these methods, applied to a study of acute health effects of exposure to antineo plastic agents among pharmacy personnel. An exposure and health symptom survey was conducted among 675 pharmacists, covering a 3-month period. Self-reported data on the average weekly number of antineoplastic agents handled and a variety (p. 208 ) of health symptom information were gathered. A validation study was conducted to assess the accuracy of the 3-month recall of exposure information. A subsample of 56 pharmacists completed 1- to 2-week diaries in which they kept track of their handling of antineoplastic drugs. These data were considered to be close to “true” because the diaries were kept onsite, and filled in continuously.

From 27 symptoms that were initially assessed by questionnaire, Spiegelman and Valanis chose one—fever—for this investigation because it was relatively prevalent (overall prevalence 17%) and plausibly associated with exposure to antineoplastic agents. The crude odds ratio for the association between prevalence of fever and the questionnaire-based number of drugs mixed per week was 1.13 (95% confidence interval: 1.03 to 1.23), comparing the 10th to the 90th percentiles of the exposure distribution. The correlation between the questionnaire exposure estimate (Xapprox) and the diary data (Xtrue) was 0.70. After adjusting the odds ratio for this error, the corrected estimate, ORcorr, was 1.22 (95% CI: 1.04 to 1.43).

The authors noted that there were several limitations to the application of regression calibration in this setting. They were concerned that because fever was fairly common, the odds ratio calculated in the logistic regression model would not be a good estimate of the prevalence ratio (Thompson, Myers et al. 1998). To address this and other limitations of regression calibration, Spiegelman and Valanis also presented a maximum likelihood method of accomplishing the same end, using a model which is probably analytically superior but more difficult to explain (Spiegelman and Valanis 1998).

8.2. CONFOUNDING

Confounding was introduced in the previous chapter, and the standard methods for controlling confounding were reviewed. Here we discuss several aspects of confounding that are of particular concern when quantitative exposure-response relations are being investigated in epidemiologic studies.

To be a confounder, a factor must be associated with the disease, which usually means that it is an independent risk factor for the disease, and the factor must be associated with the exposure. This latter requirement may need further explanation. “Association with exposure” means that there is a different prevalence of the confounder in the exposed and unexposed groups, or among groups with different exposure ranges, if the exposure variable is continuous. If the confounder is measured on a continuous scale, then this association with exposure means that the mean value of the confounding variable differs across levels of the exposure. It is also important to remember that a factor which does not vary among members of the study group cannot confound. This fact is the key to confounder control: analyzing the exposure-response association among subgroups who all share the same level of the potential confounder removes any possibility of confounding by that factor. (p. 209 )

Confounding is not an “all or nothing” phenomenon; the amount of bias in a measure of association may be small or large depending on the correlations between the confounder and exposure and between the confounder and the response variable. The amount of bias will not be greater in magnitude than the weaker of these two associations (Rothman, Greenland and et al. 2008). That is, a weak association between exposure and a confounder cannot explain a strong apparent-exposure effect. This fact has not always been adequately appreciated by reviewers of occupational cancer studies that have not been able to directly evaluate potential confounding by smoking (Kriebel, Zeka et al. 2004).

Large cohort studies constructed using work records and vital statistics information have been an important source of knowledge on occupational cancer risks. But smoking data are rarely available in these studies, because they rely on existing records, and often it is not feasible to contact subjects to obtain smoking histories. If one is using a study of this type to identify a cause of one of the many smoking-related cancers, then confounding by smoking might occur if the prevalence of smoking were different among subgroups with different levels of exposure to the potential occupational carcinogen. In the absence of smoking data, how can the possibility of serious confounding by smoking be evaluated? Several authors have shown that plausible smoking differences among subgroups of a working population will rarely create relative risks for lung cancer greater than about 1.5, and it is even lower for diseases less strongly associated with smoking (Axelson and Steenland 1988; Siemiatycki, Wacholder et al. 1988; Kriebel, Zeka et al. 2004). In other words, weak exposure-response associations might be falsely created by unmeasured confounding by smoking, but even moderately strong associations are not likely to be.

Confounding with Continuous Variables

Confounding is most often illustrated and investigated in the context of categorical exposure data by showing the change in the relative risk when the data are stratified on the confounding factor (see Chapter 7, Table 7.2). However, for the quantitative exposure-response modeling that this book emphasizes, it is important to gain an understanding of how confounding functions when exposure data are continuous (Miettinen 1985; McNamee 2005). Suppose we are studying an exposure-response relation in continuous exposure and response data, much as in the simulation in Figure 8.2 (we would like to credit Professor Olli Miettinen for this graphical understanding of confounding; (Miettinen 1985)). We will use a cartoon of such an exposure-response association in which the data are represented by an oval-shaped “cloud” which indicates the general area in which the data are concentrated (Figure 8.4). The first figure represents data in which there is a strong positive association between exposure and response, represented by the straight dotted line through the middle of the data cloud. To this simple picture, we now add a (p. 210 ) third factor—gender. Imagine that in both genders there is a similar exposure-response relation, but that for some reason females have a higher background risk than males. This is indicated in part B of Figure 8.4 by two parallel lines, one above the other but with identical slopes (the genders are said to have a common slope). The line representing the females starts at a higher baseline or background response in the absence of exposure, but the increase in response per unit change in exposure (the slope) is the same as among males. The graph in part B also indicates that the exposure distributions in males and females are the same. One can see this by

Uncertainty in Measuring Risk

Figure 8.4 Graphic representations of an exposure-response relation in continuous data. The lines represent the exposure-response relation, and the ovals indicate where the data lie. Part A: A simple, linear exposure-response relation. Part B: The same exposure effect is observed in both sexes, but among females, the background risk is higher. The exposure data have similar distributions in males and females. There is no confounding in these data.

(p. 211 ) noting that the locations on the x-axis of the “clouds” of data for males and females are very similar. If one were to calculate the means and standard deviations of the exposures in the two genders, they would be quite similar. Is gender a confounder of the exposure-response relation in part B of Figure 8.4? No. Gender is associated with response, but not with exposure, and so it cannot be a confounder.

Confounding by gender is illustrated in Figure 8.5, in which the exposure distributions of males and females are different. Notice that the female cloud is

Uncertainty in Measuring Risk

Figure 8.5 Graphic representations of an exposure-response relation in continuous data, showing confounding by sex. Part A: The same exposure effect in males and females as in Part B of Figure 8.4, but now the female exposure distribution is shifted to higher levels. Sex is a confounder, because sex is associated with both risk (females at higher risk at all exposure levels) and exposure (female exposure distribution shifted to higher levels compared to males). Part B: Illustrating the crude, confounded relation between exposure and response. If sex is ignored and a single, crude association is calculated, its slope will be greater than the true common slope for both males and females, when studied separately.

(p. 212 ) somewhat higher on the exposure scale than is the male cloud—the mean exposure in females would be greater than the mean male exposure. If gender is ignored and a single exposure-response trend is fit in the full dataset, it might look like the black dotted line in part B. The slope of this line is steeper than the true common slope for males and females. To avoid this confounding, one should analyze the data in a way that preserves the view in part A, in which one can see the common slope because the data are analyzed separately for males and females. In practice, this is accomplished through stratification or through a regression model in which gender is an independent variable entered into the model alongside the exposure variable.

Incomplete Control of Confounding

Statistical methods such as stratification or regression modeling can control confounding only by factors for which data have been gathered. If, alternatively, there is an unknown factor that is a confounder, the bias it will cause cannot be controlled in an observational study. Only randomization has the ability to control for the confounding influence of factors that are not measured. This problem is well described in every epidemiology textbook, but in practice, epidemiologists often behave as if unmeasured confounders are unlikely to be a serious problem, when there is no way to really know this. Thus there is uncertainty that derives from what we do not know—and this is by definition impossible to quantify.

Uncertainty deriving from possible bias in study design and execution is quite similar in nature—there is rarely any way to objectively know whether a bias exists, nor how strong its effect on the results might be. It is standard practice in reporting the results of an epidemiologic study to say at the end of the article that the results are valid only “if there is no bias and no uncontrolled confounding.” But because this is an unverifiable assertion, it is often essentially ignored in the evaluation of the evidence. Unfortunately, this habit may give the nonscientist the impression that the researchers are more confident than they really should be about the study findings.

8.3. BIASES

A measure of association from an epidemiologic study will be biased if there is a consistent difference between the measured association and the true association. Bias is defined formally as the difference between what is measured and the truth, and because the true association is never known, it is essentially impossible to quantify the magnitude of a bias. Rather, biases should be avoided or minimized in the design of a study. Partial assessment of bias is possible, however, when one is able to conduct validation studies—for example, on the exposure assessment as described previously. There are two broad categories of bias in epidemiology, which are reviewed here briefly: first, selection bias, and second, information bias. (p. 213 )

Selection Bias

Selection bias is best understood by seeing how it works in the two major study designs, cohort and case-control studies. In a cohort study, selection bias occurs if disease status influences selection into exposure groups. For example, suppose that we are studying an outbreak of childhood leukemia in a town. There is concern that proximity to a toxic waste site may be responsible for an increased risk of the disease. As often happens in these cases, imagine that a certain number of children with leukemia living near the site have already been identified through neighborhood meetings and friendship networks. These are often called the index cases. To follow up on this concern, we might conduct a cohort study, identifying all the children in the town and dividing them into “exposed” and “unexposed,” according to proximity to the site. A cancer registry or other case-finding system could then be used to identify all the cases of leukemia in the town. Selection bias might occur if we automatically place the index cases in the exposed group instead of using a blind and consistent exposure assignment process for all subjects. Ideally, the assignment of children to the exposed and unexposed groups would be done by someone who was unaware of the health status of the children, so that this knowledge did not influence their classification. In practice, this is often difficult in small studies.

Selection bias operates somewhat differently in case-control studies. In these studies, selection bias occurs if exposure status influences selection into the case or control groups. An example may help to illustrate the problem. Suppose one is conducting a study of cigarette smoking and risk of lung cancer. Who should be chosen as controls? If we were to choose patients with heart disease as controls, we would be introducing a selection bias, and the resulting measure of association between smoking and lung cancer would probably be an underestimate of the true association. Heart disease is more prevalent among smokers than nonsmokers, and, as a result, the contrast of smoking habits between cases of lung cancer and cases of heart disease would not be as large as between lung cancer cases and the general population. To avoid selection bias and produce an accurate estimate of the strength of the association between smoking and lung cancer, the smoking habits of the controls should be representative of the population from which the cases came. This means that controls should either be drawn in some random fashion from the general population or, if they are selected from a hospital population, controls should have diseases that are unrelated to the exposure.

Information Bias

The second type of bias, information bias, is again best illustrated separately in cohort and case-control studies. In cohort studies information bias occurs if information on disease is influenced by exposure status. For example, suppose that the clinicians diagnosing a particular disease have beliefs about the exposures which (p. 214 ) lead to that disease. They may believe correctly that people with a particular exposure are more likely to get that disease. As a result, they may look more closely for the disease in an exposed population. For example, in a study of asbestos workers, a physician might search more aggressively for cases of lung cancer because of the causal link between asbestos and this cancer. This behavior would be very appropriate clinical practice, but it might create an information bias if clinical data were used to identify cases for an epidemiologic study. If the diagnostic practices are different among exposed and nonexposed subjects, then the observed exposure-disease association may be biased.

Information bias functions somewhat differently in case-control studies. Information bias occurs if the determination of exposure status is influenced by case or control status. This problem was discussed earlier under the heading of differential exposure misclassification, which is a kind of information bias. When participant interviews are used to gather exposure information in case-control studies, then there is often concern that information bias will occur because of the potential for differential recall of personal histories among those who are sick (cases) and those who are healthy (controls). Indeed, minimizing information bias is one of the principal reasons for choosing hospital controls in case-control studies. Hospital controls are, like the cases, ill, and so may be in a similar frame of mind. As long as the control subjects’ illnesses are not associated with the exposure of interest, then this approach may avoid both selection bias (see the previous discussion of selection bias from choosing hospital controls) and information bias.

Biases are best avoided, because once data have been gathered in a biased way, it is very difficult to control the bias or to assess quantitatively how large its effects might be. Sometimes one can conduct sensitivity analyses, which can provide useful information on how large the effects of a particular bias might be in a worst case scenario, for example. An example was mentioned in section 8.2, in the context of evaluating the effects of unmeasured confounders. If data on smoking are lacking from a study in which smoking might conceivably confound an observed association, it is possible to conduct sensitivity analyses to evaluate how large the effects of smoking could have been under some assumption about the smoking distribution in the study population (Kriebel, Zeka et al. 2004).

8.4. MODEL FORM MISSPECIFICATION

Even the simplest analysis of epidemiologic data uses a model. The 2-by-2 table is a statistical model, with a set of underlying assumptions that may or may not accurately correspond to the reality that is being summarized. The assumptions behind a typical epidemiologic data analysis are numerous and often difficult or impossible to directly verify (Table 8.2). At a minimum, it is important to be aware of these assumptions, but, when possible, one should also investigate the degree to which errors in these assumptions affect the study findings. In this book we discuss (p. 215 )

Table 8.2 Examples of choices made in a typical epidemiologic study and their underlying assumptions.

Category

Typical Choice

Underlying Assumption

Exposure data

Mean of available data

Environmental samples representative & measured without error

Summary measure of exposure

Lifetime cumulative exposure (CE)

CE directly proportional to disease risk

Coding of exposure variable

Continuous variable

Risk rises exponentially with CE over its entire observed range

Confounders

Age, smoking (packyears) measured by questionnaire

Packyears is correct summary measure of tobacco

No recall bias

Risk linearly related to age

No other important confounders exist

Study design

Case control study

Controls representative of study base

Outcome

Lung cancer incidence

Diagnoses are accurate and complete

ICD coding is relevant for exposure under study

Latency

Ignore 20 years’ exposure prior to disease onset

No effect of exposure after cutoff

All prior exposure has same “potency”

Statistical model

Logistic regression

Exposure and confounders have multiplicative joint effects

Extreme exposures do not have excessive influence on slope estimate

Error distribution is appropriate

Evaluating chance variability

Statistical significance, p〈 0.05

No bias

No uncontrolled confounding

All of above assumptions are correct

the construction of models to describe the toxicokinetics and pharmacodynamics that govern the biologic pathways between exposure and disease. These additional models may add additional uncertainty to epidemiologic analyses, but they can also be seen as a way to make explicit certain fundamental assumptions about how exposure leads to disease—processes that necessarily underlie even the simplest exposure-response model.

A simple example may help to clarify this point. Suppose that one wishes to investigate the effects of both an air pollutant and tobacco smoke on lung disease risk. The typical approach would be to enter variables coding for each of these covariates into a regression model. If the regression model is a linear one, then the assumption being made is that the air pollutant and tobacco exposures contribute additively to change in the dependent variable. If, on the other hand, one uses a logistic or Cox proportional hazards model, then the implicit assumption is that the two risk factors have a multiplicative relation to the dependent variable. Although these two types of joint effect are the most frequently discussed in epidemiology, (p. 216 ) there are in fact many other ways in which two factors may contribute to a dependent variable. Many other possibilities could be represented by disease process models like those described elsewhere in this book. With an explicit hypothesis about the ways that the pollutant and tobacco smoke entered the body, reached the target tissue, and caused cellular damage, one might construct a biologically based model. The nature of the joint effect of the two exposures—whether additive, multiplicative, or something else—would not be a separate assumption but would be a consequence of the particular structure of the disease process model. Although we believe that this approach holds promise, there is as yet little published research about the best ways to construct and apply these kinds of models. It is clear that even simple models can have profoundly different implications for underlying processes. The epidemiologic behavior of disease process models is discussed further in Chapter 9.

Returning now to the world of standard epidemiologic practice, it must be said that here, too, there is little research on the impacts of the choice of a particular statistical model on study results. The choice of, say, a logistic model or a 2-by-2 table is often made for convenience, the availability of software, and other reasons that have little to do with the nature of the underlying exposure-disease process. Using a model that does not adequately capture the behavior of the underlying physiologic processes can introduce an additional source of error into the measure of association or other study results. This topic is very little investigated in epidemiology, but we recommend that investigators begin to study the implications of model selection by varying the model structures that they use in investigating the impacts of these changes on study results (Robins and Greenland 1986; Maldonado and Greenland 1993; Stayner, Smith et al. 1995; Greenland 1996; Maldonado and Greenland 1996).

We recommend that all epidemiologic studies include an investigation of the sensitivity of the results to the choice of model form (Greenland 1996; Greenland 2001). The biologically based models that we advocate in this text are experimental, and there is as yet very little experience with them. It is therefore important that the results from a particular model form be compared with those that come from other approaches to modeling the same data. In the absence of knowledge of the true exposure risk association, the best one can do in this investigation is to report the sensitivity of the final findings to the choices that have been made in the forms of the models.

Investigators from the National Institute for Occupational Safety and Health (NIOSH) have published studies comparing the results of a wide variety of different models fit to epidemiologic data on cohorts exposed to cadmium (Stayner, Smith et al. 1995) and to silica (Rice, Park et al. 2001; Park, Rice et al. 2002). These results demonstrate that the essential or qualitative findings about the risk from cadmium and silica exposures were consistent across a range of different types of epidemiologic models. However, the selection of a particular model could, in some cases, have a large influence on the resulting estimates of risk. This variability (p. 217 ) from model to model serves as a warning not to rely overly much on any one model form in the absence of strong hypotheses—for instance, based on mechanistic studies—about which one is most likely to be correct. An excellent discussion of the issues involved in choosing among exposure-response models has been published by Steenland and Deddins (Steenland and Deddens 2004).

8.5. ERRORS IN OUTCOME VARIABLES

Both random and systematic errors in the outcome variable must be considered. Random or sampling variability in the number of outcome events is perhaps the most familiar source of uncertainty in epidemiologic studies, and it is investigated with the use of p-values and confidence intervals. Despite the emphasis placed on measuring this source of uncertainty and achieving “statistical significance,” sampling variability in outcome probably accounts for a minority of the uncertainty in most epidemiologic studies. Systematic errors in outcome variables can be investigated in many of the same ways as systematic errors in exposure variables (Pearce, Checkoway et al. 2006). Systematic errors in outcome variables, although just as important as errors in exposure variables, are not discussed further because our focus is on improving exposure estimation.

Confidence intervals or p-values are found in every epidemiologic study, and so it is important to understand what they do and do not measure. In a typical epidemiologic investigation of an exposure-response relationship, one studies a population in which there is some variation in the level of exposure among population members. The quantitative estimate of the change in risk across this range of exposures is typically evaluated using a regression model. The slope of the regression equation estimated from the observed population data is a measure of the strength of the association between exposure and response. We noted in section 8.1 that all standard regression methods assume that there is no error in the exposure variable, because these methods were developed for the analysis of data from experiments in which the independent variables were fixed by the investigator. It is very important, therefore, to always remember that the p-values and confidence intervals that one calculates for epidemiologic results do not take into consideration uncertainty in the exposure and other independent variables.

Evaluating Variability in Outcome

The notion of statistical significance and the associated concepts of the p-value and confidence interval were developed in the statistics of experiments (Neyman and Pearson 1928). In the experimental setting, it is often possible to randomize observations across treatment, or exposure groups. This approach allows some confidence that confounding has been avoided, including confounding by factors that (p. 218 ) are completely unknown to the researcher. For this reason, experiments can be designed to formally test hypotheses, and one can reject or fail to reject these hypotheses by quantifying the role that chance variability may have played in the experimental results.

In the absence of randomization, it is much more difficult to view the measurement of exposure-response associations as leading to formal hypothesis testing. Nevertheless, because this view is so widely taught in epidemiology and statistics courses, it provides an important framework for the interpretation of study results, and a brief review may be helpful.

Hypothesis Testing

The essence of the testing framework is the contrast between two competing hypotheses, called the null hypothesis (H0) and the alternative hypothesis (HA). In any given investigation, it is generally a fairly simple matter to identify the null hypothesis, whereas there can be many alternative hypotheses. When an exposure-disease relation is being studied, then the null hypothesis will usually specify that there is no association between exposure and disease, or equivalently that the exposed and unexposed do not differ in disease incidence (or prevalence, or some other measure of outcome). Several common alternative hypotheses are: that the exposed show a greater incidence of disease than the unexposed, that the exposed show a lower incidence than the unexposed, or the more cautious hypothesis that the exposed show a different incidence than the unexposed.

The role of data in this framework is to allow a test of the null hypothesis. If H0 is rejected, then this is taken as evidence in support of HA. Alternatively, if H0 is not rejected, then it is supported by the data. Technically, one can never “prove” either hypothesis, and one cannot directly test the consistency of the data with HA—only their inconsistency with H0. One tests the null hypothesis and evaluates “statistical significance” using a p-value. This statistic assesses the likelihood that the data are consistent with H0 or, more informally, provides an answer to the question: How likely is it that the observed results1 could have occurred, by chance alone, if the H0 is correct?

Suppose that we have conducted an epidemiologic study in which we have observed a relative risk of 2.0, comparing some disease risk among exposed and unexposed groups. We can presume that the null hypothesis was that there was no association between exposure and disease. Let us suppose that our alternative hypothesis was the most cautious one, stating simply that exposure affects disease risk, without specifying whether it increases or decreases risk. We could summarize the competing hypotheses like this: H 0 :RR=1.0 H A :RR1.0 (p. 219 )

Suppose that we observed a p-value for this result of 0.02. Because this is less than the traditional 5% threshold, the result can be called statistically significant. But what does that mean? The p-value has the following meaning: there is a 2% chance that the result obtained (or a more extreme result) would have occurred by chance, if H0 is true. There are many common misinterpretations of p-values. For example, “there is a 98% chance that H0 is wrong” and “there is a 98% chance that there is a difference in risk between the exposed and unexposed groups” are both incorrect statements. One of the most difficult concepts to grasp about hypothesis testing is that a p-value has its formal meaning only in a null world. That is, the probabilistic interpretation of the 2% result in this example is correct if and only if there is in truth no difference in risk between exposed and unexposed groups. If the p-value is small, then we can conclude that the data we obtained are unlikely to have occurred in a null world (assuming the study was conducted without bias and that there was no confounding). The p-value does not speak directly to the more interesting question: How likely is it that an alternative hypothesis might be true? For example, how likely is it that exposure increases disease risk? The p-value does not tell us this.

The Limitations of Statistical Significance

Before moving on to discuss confidence intervals, which are an alternative way to evaluate the role of chance variation, let us examine more closely what the results in the previous example might actually mean. A study with RR= 2.0 and p= 0.02 might be the result of any of the following:

  • There is a true (positive) association between exposure and disease.

  • There is no association, but by chance, RR= 2.0 was observed.

  • There is no association, but through bias, confounding, model misspecification or some other systematic error, RR= 2.0 was observed.

How likely is the first possibility? We lack the data to make a quantitative statement about this. We can say that the second alternative is “unlikely” given the small p-value and assuming no bias or confounding. We cannot quantitatively evaluate the third possibility. Some examples of the assumptions covered by the absence of bias and confounding include: (1) the data must be a random or representative sample from the study base population; (2) the observations must be independent, one from the other; and (3) the measurements of the independent variables must be accurate.

Suppose now that the results were slightly different, so that the p-value was “not statistically significant”; p= 0.07, for example. The likely explanations for these findings are:

  • There is, in truth, no association between exposure and disease.

  • (p. 220 )
  • There is an association, but by chance, a “nonsignificant” p-value was obtained.

  • There is an association, but through bias, confounding, model misspecification or some other systematic error, a “nonsignificant” p-value was obtained.

The second possibility is especially likely if the study was small and so did not have much power to detect a doubling of risk comparing exposed and unexposed groups. A key point here is that a “nonsignificant” p-value does not distinguish between two very different situations: the absence of an association on the one hand and inadequate evidence of association on the other.

To summarize this brief presentation of statistical significance:

  • A statistically significant p-value does not directly evaluate the likelihood that the observed association is present (true).

  • A nonstatistically significant p-value does not distinguish between the absence of an association and inadequate evidence with which to evaluate the association.

  • p-values only have their literal interpretations in the absence of bias and confounding, which occurs rarely, if ever, in observational studies.

Because of these limitations, we recommend placing minimal emphasis on p-values when interpreting epidemiologic results and avoiding the concept of statistical significance entirely. The task of an epidemiologic study is not to “prove” the existence of an effect. Rather, our interest is in assessing how supportive study results are of one or another of a limited number of alternative states of nature. That is, we are interested in asking how much the data should weigh in our evaluation of a hypothesis rather than in seeing the data as providing a formal test with which we will reject or accept a particular hypothesis.

Confidence Intervals: An Alternative to p-values

Confidence intervals provide an alternative method for quantifying the role of chance variability in study outcomes and for avoiding some but not all of the limitations of p-values. To understand what a confidence interval means, imagine that a particular study is repeated again and again, in hypothetical repetitions. In this imaginary world, the study is repeated identically, with the only difference from repetition to repetition being the chance variability in the number of outcome events in the exposed and unexposed groups. At the end of each study, the results are analyzed, and a relative risk and confidence interval are calculated. The 95% confidence interval (the only interval that is routinely calculated) around the RR can be interpreted in the following way: in hypothetical repeated samplings, the 95% confidence intervals calculated around the repeated sample risk estimates would include the true risk 95% of the time. Any given confidence interval—for (p. 221 ) example, the one that we actually do calculate in our real study—may or may not include the true value of the relative risk.

The 95% “coverage,” as it is called, assumes that there is no bias and no confounding—an assumption that we have argued is generally difficult to accept. We believe that, given the many uncertainties in observational studies, a looser interpretation of confidence intervals is more appropriate. They should be viewed as representing the “supported range”—the values of the relative risk that are “more likely.” If the supported range lies far from the null, then one can conclude that there is fairly strong evidence for the existence of an association or effect. Narrower confidence intervals are interpreted to mean greater certainty about where the true risk estimate may lie. Furthermore, a confidence interval that excludes the null value of the measure of association provides stronger evidence against the null than a confidence interval wide enough to include this value. It is important, however, not to overinterpret the location of the null value as either inside or outside of the confidence interval. Some authors use this distinction as a test of statistical significance (if the null value is included in the 95% confidence interval, then the result is said to be statistically significant), and we believe that such rigid testing is inappropriate in the context of most environmental epidemiology studies. Finally, it is important to remember that a tight confidence interval cannot make up for serious bias or uncontrolled confounding. In short: garbage in, garbage out, no matter how much garbage there is.

8.6. MANAGING UNCERTAINTY

The preceding sections of this chapter have provided an overview of the major sources of uncertainty in epidemiologic studies (Table 8.1). Approaches to evaluating and managing uncertainty vary, from formal statistical methods such as confidence interval estimation to qualitative descriptions of such potentially important topics as the choice of statistical model or the underlying assumptions about the time course of the disease process.

At the end of a long chapter on all the ways that studies can be in error, or uncertain, it is perhaps useful to remember that uncertainty is a positive and essential aspect of scientific inquiry. The uncertainty points the way to new knowledge and further investigation (Levins 1995). There is, at the same time, a strong desire on the part of scientists to be precise, which may come from confusing uncertainty of information with quality of information—two distinct concepts (Funtowicz and Ravetz 1990). One can have high-quality information about greatly uncertain phenomena.

Much additional research is needed to develop and standardize methods to characterize, express, and communicate uncertainty (Walker, Harremoës et al. 2003; Kriebel 2008). Researchers in each narrow scientific discipline develop professional judgment that they use to assess how strong a particular study finding is (p. 222 ) and how important the various uncertainties are. But the development of this professional judgment is largely intuitive and not formalized. It is therefore difficult to communicate to outsiders the full complexity of one’s assessment of the “weight of evidence” that a study’s findings contribute. There is a need, therefore, for research on the characterization and communication of uncertainty in epidemiologic research. Uncertainties that derive from the choice of research methods and mathematical models are especially in need of investigation because so little work has been done in this area.

Notes

Notes:

(1.) Technically: how likely is it that the observed results or more extreme results could have occurred by chance under H0.