The Analysis of Contingency Table Data: Logistic Model I
Abstract and Keywords
This chapter discusses the use of a logistic regression model to analyze data classified into a multiway table. Topics covered include the simplest model, the 2 x 2 x 2 table, the 2 x k table, the 2 x 2 x k table, the multiway table, and goodnessoffit.
Keywords: logistic regression model, epidemiologic data analysis, 2 x 2 x 2 table, 2 x k table
Epidemiologic data frequently consist of cases of disease or deaths tabulated by categorical variables or tabulated by continuous variables made into categories. Converting continuous variables into categorical variables simplifies the presentation and is usually based on traditional definitions (at a cost of statistical power, as already noted). For example, individuals are considered hypertensive or not depending primarily on whether their systolic blood pressure exceeds 140 mm Hg, and smokers are placed into exposure categories based on the reported numbers of cigarettes smoked per day. The end result is a table of counts. A table is certainly a useful summary of collected data, but, in addition, it is often important to explore the sometimes complex relationships found within a table. Such a table (Table 7–1 ) describes three risk factors and a disease outcome from the Western Collaborative Study (WCGS; Appendix A in this volume).
Table 7–1 contains the counts of coronary heart disease (CHD) events and the number of individuals at risk classified by behavior type, systolic blood pressure, and smoking exposure. Clearly, increased risks are associated with increasing amounts smoked, high blood pressure, and typeA behavior pattern. A number of questions, however, are not easily answered:
Does the risk of smoking have a threshold influence, or does risk increase more or less consistently as the number of cigarettes smoked increases?
(p.191)What is the influence of blood pressure and smoking on the behavior/CHD relationship?
Do these three risk factors have joint or separate influences on the probability of a CHD event?
What is the relative influence of each risk factor on the likelihood of a coronary event?
Frequently, the inability to answer complex or even elementary questions about risk/disease relationships in a satisfactory way results, to a large extent, from lack of data. If hundreds of thousands of individuals were classified into a table with numerous categories, most questions about the role of the risk factors in the occurrence of a disease could be directly answered. When more than a few variables are investigated or large amounts of data are not available, a table often fails to effectively describe the relationships among the categorical variables. Additionally, human disease is, almost always, a relatively rare event, and it causes low cell frequencies in most tabulated data. For example, only three coronary events occur in the low blood pressure, heavy smoking, typeB category among 34 observations when a relatively large sample of 3154 men was collected. Therefore, it becomes necessary for analytic procedures to account for the impact of sampling variation. A statistical analysis based on a logistic regression model is one way to describe simply and efficiently the relationships between risk factors and a binary disease outcome classified into a multiway table.
The use of a logistic regression model to analyze data classified into a multiway table is the topic of this chapter. The extension to include continuous risk variables is the topic of the next chapter. Logistic regression is by no means the only approach to the analysis of risk/disease relationships, but it is frequently used and demonstrates both the strengths and weaknesses of multivariable techniques applied to epidemiologic data.
THE SIMPLEST MODEL: DISCRETE CASE
A 2 × 2 table provides the simplest illustration of a logistic model applied to the relationships among contingency table variables. To start, the variables to be studied are the presence (D = 1) and absence (D = 0) of a disease investigated at two levels of a risk factor (F = 1, risk factor present; F = 0, risk factor absent). The general notation (repeated) for a contingency table applies (Table 7–2 ).
A chisquare analysis is the most common approach for assessing an association within a 2 × 2 table. Other methods based on conditional probabilities such as P(D  F) and are also used to describe the risk/disease relationship. (p.192)
Table 7–1. Coronary heart disease classified by behavior type, systolic blood pressure, and smoking exposure: a summary table of probabilities of CHD
Blood pressure (mm Hg) 
Behavior type 
Smoking frequency (cigarettes per day) 


0 
1–20 
21–30 
> 30 

≥ 140 
A 
29/184 = 0.158 
21/97 = 0.216 
7/52 = 0.135 
12/55 = 0.218 
≥ 140 
B 
8/179 = 0.045 
9/71 = 0.127 
3/34 = 0.088 
7/21 = 0.333 
< 140 
A 
41/600 = 0.068 
24/301 = 0.080 
27/167 = 0.162 
17/133 = 0.128 
< 140 
B 
20/689 = 0.029 
16/336 = 0.048 
13/152 = 0.086 
3/83 = 0.036 
Table 7–2. Notation for a 2 × 2 contingency table
Factor present 
Disease 


D = 1 (present) 
D = 0 (not present) 
Total 

F = 1 
n _{11} 
n _{12} 
n _{1.} 
F = 0 
n _{21} 
n _{22} 
n _{2.} 
Total 
n _{.1} 
n _{.2} 
n 
A baseline measure of disease of risk is the estimated logodds when the risk factor is absent, or
The two quantities (a, b) form the simplest possible linear logistic model. That is, logodds = a + bF, where F = 0 or F = 1, so that when F = 0, a is estimated by logodds = log(n _{21}/n _{22}) = â and when F = 1, a + b is estimated by .
The value estimates, on the logodds scale, the change in risk of disease associated with the presence of the risk factor (F) relative to the absence of the same factor . The risk factor in the context of a linear model has other names—it is sometimes called the predictor variable, the explanatory variable, or the independent variable. A difference in logodds is not a particularly intuitive way to assess a risk factor/disease association, but it has tractable mathematical properties and, of most importance, relates directly to the odds ratio measure of association, which is the primary reason for using a logistic regression model. In symbols, .
Consider a 2 × 2 table (Table 7–3 ) from the WCGS data describing the relationship between behavior type (risk factor) and a coronary event (disease outcome). From these data, â = log(79/1486) = – 2.934, and . The estimated change in the logodds of CHD associated with typeA individuals (risk factor present) relative to typeB individuals (risk factor absent—baseline) is (Fig. 7–1 ). Since , then and is the usual estimate of the odds ratio calculated from a 2 × 2 table (Appendix C ). For the behavior type data (Table 7–3 ), the odds ratio estimate is . The odds of a coronary event is, therefore, 2.373 times greater for a typeA individual than for a typeB individual. This same odds ratio can also be calculated directly from the tabled data where .
When the number of model parameters estimated equals the number of observed logodds values calculated from a table, the results of using a model to describe associations and making direct calculations from the data are always identical. Such a model is called saturated. The twoparameter logistic model
Table 7–3. CHD by A/B behavior type
Behavior type 
CHD 
No CHD 
Total 

A 
178 
1411 
1589 
B 
79 
1486 
1565 
Total 
257 
2897 
3154 
The two primary ways to evaluate the influence of this sampling variation on the estimated logodds are a significance test and a confidence interval.
Aside: The relative merits of a confidence interval versus a significance test have been occasionally discussed, even debated ([ 1 ], [ 2 ], and [ 3 ]). The confidence interval and the twosided significance test rarely give contradictory results. The confidence interval conveys more information in the sense that the interval width gives some idea of the range of parameter possibilities (power). A confidence interval is expressed in the same units as the estimated quantity which is sometimes helpful. A significance test reported in isolation tends to reduce conclusions to two choices (“significant” or “not significant”), which, perhaps, emphasizes the role of chance variation too simply. For example, estimates can be “significant” (chance not a plausible explanation) because a large number of observations are involved, but the results do not reflect an important biological or physical measurement. However, a (p.196) confidence interval approach is complicated when more than one estimate is involved. In these situations several estimates can be assessed simultaneously with a significance test where the analogous confidence region is difficult to construct and complex to interpret. Of course, a significance test and a confidence interval can both be presented. However, if one approach must be chosen, a confidence interval usually allows a direct and more comprehensive description of the influence of the impact of sampling variation on a single estimated quantity.
Lack of precision, incurred because the collected data represent a fraction of the population sampled, must be taken into account to realistically assess the influence of a risk factor on the likelihood of a disease. To evaluate the impact of this sampling variation associated with the estimate , both a significance test and a confidence interval require an estimate of the variance of the distribution of the estimate , which is
For the WCGS data, the Wald’s chisquare teststatistic is X ^{2} = (0.864)^{2}/0.020 = 37.985, yielding a pvalue of P(X ^{2} ≥ 37.985  b = 0) < 0.001 and providing substantial evidence that A/Bbehavior type is associated with CHD incidence. (p.197) An approximate (1 – α)% confidence interval based on the estimate has and because has an approximate normal distribution for moderate or large samples of observations. The value z _{1–α/2} is the (1 – α/2)thpercentile of a standard normal distribution. A (1 – α)% confidence interval for the odds ratio or = e^{b} is then (e ^{lower}, e ^{upper}). The approximate 95% confidence interval based on the estimate from the behaviortype data is (0.589, 1.139). The approximate 95% confidence interval for the odds ratio (or) is then (e ^{0.589} = 1.803, e ^{1.139} = 3.123), indicating a likely range of the possible values for the underlying “true” odds ratio, estimated by or = e ^{0.864} = 2.373.
Using a linear model to represent the logodds is related to describing the probabilities from a 2 × 2 table with a logistic function—thus the name “logistic regression.” The logistic function, which has historically been used in a variety of contexts to study biological phenomena, is formally
In symbols, the logistic probabilities applied to a 2 × 2 table are
Logistic probabilities from a 2 × 2 table imply a linear relationship among the logodds, and vice versa. For example, the logodds measures risk when the risk factor is present and
An odds ratio is most easily interpreted when the frequency of the disease is low (less than 0.1 in both riskfactor groups, F and ) since, in this case, the estimated odds ratio is approximately equal to a simpler measure of association between risk factor and disease called the relative risk. In symbols, the . Relative risk is a natural measure of risk to compare the probability of disease between one group that possesses the risk factor (F) to another that does not the possess the risk factor and is discussed in detail elsewhere (for example, [ 4 ] and [ 5 ]). Nevertheless, the odds ratio measures association regardless of the frequency of the outcome variable.
A summary of the logistic regression model applied to estimate the association between behavior type and coronary heart disease using the WCGS data is given in Table 7–4 . The logistic probabilities are estimated as and , whether calculated from the logistic function using estimates â and or directly from the data, again indicating that the logistic model is saturated when it is applied to a 2 × 2 table. Specifically, for the typeA individuals (F = 1)
Table 7–4. CHD by A/B: the twoparameter logistic model
Variable 
Term 
Estimate 
SE 
pvalue 


Constant 
â 
− 2.934 
0.115 

A/B 

0.864 
0.140 
< 0.001 
2.373 
− 2 loglikelihood = 1740.334; number of model parameters = 2.
THE 2 × 2 × 2 TABLE
When an association is detected in a 2 × 2 table, a natural question arises: How is this association influenced by other variables? The WCGS data show a strong association between behavior type and coronary disease, but the question remains: Could this association be, at least in part, a result of the influences of other risk factors? For example, systolic blood pressure level is related to both the risk of coronary disease and behavior type and could substantially influence the observed association. It is entirely possible that typeA individuals have on average higher blood pressure levels than typeB individuals, and it is this increased level of blood pressure that produces the higher risk of CHD and not the difference in behavior types. A 2 × 2 × 2 contingency table sheds some light on the influence of a third variable on the relationship under investigation (Chapter 6 ). To illustrate, the role of systolic blood pressure as a possible influence on the relationship between behavior type and coronary disease is explored from a 2 × 2 × 2 table of WCGS data (Table 7–5 ). Another description of these data is possible in terms of the logodds (Table 7–6 ).
The influence of behavior type is measured by the difference in logodds, or for individuals with blood
Table 7–5. CHD by A/B and systolic blood pressure
Behavior type 
Blood pressure ≥ 140 
Blood pressure < 140 


CHD 
No CHD 
Total 
CHD 
No CHD 
Total 

A 
69 
319 
388 
109 
1092 
1201 
B 
27 
278 
305 
52 
1208 
1260 
Total 
96 
597 
693 
161 
2300 
2461 


Table 7–6. CHD by A/B and systolic blood pressure: a summary
Blood pressure 
Behavior type 
Logodds 
Notation 
Atrisk 
CHD 
No CHD 

≥ 140 
A 
− 1.531 
l _{11} 
388 
69 
319 
≥ 140 
B 
− 2.332 
l _{12} 
305 
27 
278 
< 140 
A 
− 2.304 
l _{21} 
1201 
109 
1092 
< 140 
B 
− 3.145 
l _{22} 
1260 
52 
1208 
The failure of these two measures of association (’s or ’s) to be equal in each subtable brings up a fundamental issue. One of two possible situations exists:

1. The influence of the risk factor under study is the same in both subtables and the observed differences arise only because of random variation (interaction absent).

2. The influence of the risk factor under study is different in each subtable (interaction present).
The first possibility suggests that the two estimates should be combined to produce a single summary measure of the association between risk factor and disease. The second suggests quite the opposite. As noted before, if a variable behaves differently in each subtable, then a combined estimate possibly conceals or exaggerates the difference and will occasionally be completely misleading. As always, the presence of interactions restricts summarization.
In symbolic terms, for a 2 × 2 × 2 table, an interaction measured by the logodds is interaction = b _{1} – b _{2} = (l _{11} – l _{12}) – (l _{21} – l _{22}) = l _{11} – l _{12} – l _{21} + l _{22} or, similarly, interaction = c _{1} – c _{2} = (l _{11} – l _{21}) – (l _{12} – l _{22} = ) = l _{11} – l _{12} – l _{21} + l _{22}. The expression l _{11} – l _{12} – l _{21} + l _{22} is the difference between two differences and quantifies the amount of interaction associated with two risk factors on a logodds scale. An estimate of the magnitude of the interaction associated with behavior type and blood pressure is .
(p.201) A key issue is whether the data provide evidence of a nonrandom (“real”) interaction. Testing the null hypothesis that l _{11} – l _{12} – l _{21} + l _{22} = 0 is equivalent to testing the hypothesis that b _{1} = b _{2} = b or c _{1} = c _{2} = c (are the differences different?). A statistical test to evaluate the magnitude of the estimated interaction effect requires an expression for the variance. The variance associated with the distribution of the estimated interaction effect in a 2 × 2 × 2 table is estimated by
The linear model underlying the logodds (denoted l_{ij} ) approach to describing relationships in a 2 × 2 × 2 table is
Another view of the parameter d is d = b _{1} – b _{2} = c _{1} – c _{2}, or
Table 7–7 summarizes the estimates of these four components (a, b _{2}, c _{2}, and d) that describe the relationship of the risk variables, blood pressure, and behavior type to coronary disease in terms of the logistic model.
The application of the logistic model to a 2 × 2 × 2 table can be geometrically interpreted as two straight lines on a logodds scale, each depicting the influence of the risk factor at each level of the third variable. The parameter d indicates the degree to which these two lines are not parallel. Figure 7–2 shows two lines representing the change in CHD risk associated with behavior type for each level of blood pressure estimated from the WCGS data. The model estimated odds ratios (last column in Table 7–7 ) could have been calculated directly from the tabled data, indicating that the fourparameter model is saturated. For example, , or , and is identical to . Four logodds values means a fourparameter logistic model is saturated.
When no interaction exists (d = 0; that is, or _{1} = or _{2}), an additive logistic model (not saturated) represents the relationships in a 2 × 2 × 2 table, again in terms of the logodds:
Table 7–7. CHD by A/B by systolic blood pressure: the fourparameter logistic model
Variable 
Term 
Estimate 
SE 
pvalue 


Constant 
â 
− 3.145 
0.142 

A/B 

0.841 
0.174 
< 0.001 
2.319 
Blood pressure 
ĉ _{2} 
0.813 
0.246 
< 0.001 
2.256 
Interaction 

− 0.040 
0.297 
0.892 
0.960 
− 2 loglikelihood = 1709.934; number of model parameters = 4.
This statistical model is additive because the logodds associated with possessing both risk factors (l _{11}) relative to possessing neither risk factors (l _{22}) is exactly the sum of the logodds influences from each risk factor—namely, b and c. In symbols, l _{11} – l _{22} = a + b + c – a = b + c. A direct result of an additive model is that d = 0, or d = l _{11} – l _{12} – l _{21} + l _{22} = (a + b + c) – (a + c) – (a + b) + a = 0. The values b and c take the place of the previous values b _{1}, b _{2}, c _{1}, and c _{2}. As required, b _{1} = b _{2} = b and c _{1} = c _{2} = c because d = 0. An iterative procedure or a computer program is necessary to estimate the model parameter values of a, b, and c from a 2 × 2 × 2 table.
The maximum likelihood estimates (Appendix E ) from the WCGS data are â = – 3.136, , and ĉ = 0.786. That is, the estimated model is logodds = – 3.136 + 0.827F + 0.786C. From the additive model and the estimated coefficients (Table 7–8 ), the modelgenerated logodds, and probabilities can be estimated (Table 7–9 ). For example, , giving
The additive (nointeraction) model requires that the influence of the risk factor be exactly the same in both subtables, or b _{1} = b _{2} = b and c _{1} = c _{2} = c. The estimated odds ratio associated with behavior type is , which contrasts the CHD risk associated with behavior type adjusted for the influence of blood pressure. A typeA individual has odds of a coronary event
Table 7–8. CHD by A/B by systolic blood pressure: the threeparameter additive logistic model
Variable 
Term 
Estimate 
SE 
pvalue 


Constant 
â 
− 3.136 
0.124 

A/B 

0.827 
0.141 
< 0.001 
2.286 
Blood pressure 
ĉ 
0.786 
0.138 
< 0.001 
2.195 
− 2 loglikelihood = 1709.954; number of model parameters = 3.
Table 7–9 CHD by A/B and systolic blood pressure: expected values based on the additive model parameter estimates (Table 7–8 )
Blood pressure 
Behavior type 
Logodds 
Notation 
At risk 
CHD 
No CHD 

≥ 140 
A 
− 1.523 
l _{11} 
388 
69.46 
318.54 
≥ 140 
B 
− 2.350 
l _{12} 
305 
26.54 
278.46 
< 140 
A 
− 2.309 
l _{21} 
1201 
108.54 
1092.46 
< 140 
B 
− 3.136 
l _{22} 
1260 
52.46 
1207.54 
In the absence of an interaction, the factors influencing the outcome act additively on the logodds scale and multiplicatively on the odds ratio scale. No interaction, therefore, requires each factor to contribute to the risk of disease separately, and the overall odds ratio becomes a product of a series of specific odds ratios, each associated with a single risk factor (in symbols, or = Πor_{i} ). Under these additive conditions, the risk factors are said to have an “independent” influence on the risk of disease.
A comparison of the modelgenerated values from the additive (nointeraction) model in Table 7–9 with the corresponding values in Table 7–6 (observed values) shows no important differences occur when the interaction term is deleted from the model (interaction present versus interaction absent). The influence of an interaction is formally evaluated by contrasting the loglikelihood statistics (Appendix E ) associated with each model. The logarithm of the likelihood value multiplied by – 2 is called the loglikelihood statistic. A loglikelihood statistic is a relative measure of the goodnessoffit of a specific model and the difference between two loglikelihood statistics produces a teststatistic with an approximate chisquare distribution. A loglikelihood statistic increases with each (p.206) parameter deleted from a model. This increase is likely to be small when the deleted parameter is unrelated to the disease under study and substantial if the deleted parameter influences the risk of disease. The difference between two loglikelihood statistics, therefore, reflects the comparative strengths of two models to “explain” the data.
For the WCGS data, the two loglikelihood statistics are L _{ d=0} = 1709.954 from the no interaction model (the parameter d deleted from the model) and L _{ d≠0} = 1709.934 from the saturated model (the parameter d maintained in the model), showing that the descriptive power of the logistic model is essentially unchanged by assuming that no interaction exists (d = 0). The difference between two loglikelihood statistics has an approximate chisquare distribution when the observed increase arises from strictly capitalizing on random variation. The degrees of freedom are equal to the difference in the number of parameters necessary to describe each model. Using this loglikelihood chisquare test verifies that setting d = 0 in the logistic model has essentially no impact. Formally, the chisquare statistic is X ^{2} = L _{ d=0} – L _{ d≠0} = 1709.954 – 1709.934 = 0.020 with one degree of freedom and produces a pvalue of P(X ^{2} ≥ 0.020  d = 0) = 0.888. Contrasting loglikelihood statistics to evaluate the descriptive worth of two competing models is a fundamental statistical tool and is used repeatedly in the following discussion.
To evaluate the role of systolic blood pressure, the logistic model with blood pressure deleted from the model is the key to contrasting loglikelihood statistics. The additive model with the parameter c set to zero is summarized in Table 7–4 . The difference in the loglikelihood statistics for the model including blood pressure and the model excluding blood pressure indicates the contribution of the blood pressure variable to the additive model describing the risk of coronary disease. Thus, L _{ c=0} = 1740.344 (blood pressure deleted) and L _{ c≠0} = 1709.954 (blood pressure included) gives X ^{2} = L _{ c=0} – L _{ c≠0} = 30.390. The teststatistic X ^{2} has a chisquare distribution with one degree of freedom when c = 0. The associated pvalue is P(X ^{2} ≥ 30.390  c = 0) < 0.001, showing that systolic blood pressure is a vital part of the description of CHD risk. The comparison is not influenced by behavior type because this risk variable is maintained in both additive models.
A loglikelihood approach to evaluating a single variable is not basically different from the statistical test of the null hypothesis stating that a single parameter is zero. As indicated earlier, a squared estimated coefficient divided by its variance has an approximate chisquare distribution with one degree of freedom when the expected value of the coefficient is zero (Wald’s test). Specifically, the blood pressure teststatistic is X ^{2} = ĉ ^{2}/variance(ĉ) = (0.786/0.138)^{2} = 32.4 with a pvalue < 0.001. The usefulness of employing loglikelihood statistics to evaluate hypotheses stems from the fact that combinations of several risk (p.207) factors can be assessed simultaneously by comparing a model containing these risk factors to a model excluding these factors, resulting in a rigorous statistical assessment.
We can add here a note on the power to detect interaction effects. The power of a statistical test to detect an interaction can be relatively low, compared to tests of direct effects. In terms of the WCGS data, the probability of identifying an interaction between blood pressure and behavior type is less than the probability that such variables as blood pressure and behavior type will be identified as significant risk factors.
In a 2 × 2 × 2 contingency table, the reason for a relatively less powerful test is directly seen. Recall that the variance of the estimated measure of interaction is
It is essential to detect interaction effects when they exist. It is not as critical to eliminate interaction terms when the data can support an additive model. (p.208) In terms of assessing the null hypothesis of no interaction (H _{0}), a type I error (rejecting H _{0} when H _{0} is true) is not as important as a type II error (accepting H _{0} when H _{0} is not true) when it comes to testing for an interaction. For a test of the impact of an interaction, it is a good idea to increase the level of significance (say, α = 0.2) to increase the power (Chapter 3 ). Increasing the type I error α to attain more statistical power (decreasing the type II error) is a conservative strategy in the sense that relatively minor losses occur, such as some loss of efficiency and increased complexity of the model, if interaction terms are unnecessarily included in the model (type I error). Mistaken elimination of interaction effects (type II error), however, substantially disrupts the validity of any conclusions (“wrong model bias”). To repeat one more time, ignoring an interaction leads to potentially misleading results.
THE 2 × k TABLE
A 2 × k table is effectively analyzed using a logodds measure of risk. Underlying the logodds approach is the assumption that the levels of the risk factor act as series of multiplicative effects. The previous approach to a 2 × k table used a straight line to summarize a series of proportions (Chapter 6 ), implying a linear relationship between the probability of disease and the levels of the risk factor. The WCGS data (Table 7–10 ) provide an example of a 2 × 4 table relating coronary heart disease to the amount smoked (k = 4 levels).
A logistic model that exactly duplicates (i.e., is a saturated model) the information contained in a 2 × k contingency table in terms of logodds (denoted l_{i} ; ) is, for k = 4, l _{0} = a, l _{1} = a + b _{1}, l _{2} = a + b _{2}, and l _{3} = a + b _{3}, or logodds = a + b _{1} x _{1} + b _{2} x _{2} + b _{3} x _{3}, where x _{1}, x _{2}, and x _{3} are three binary variables (0, 1) that identify each level of the categorical risk factor. For example, the values x _{1} = 0, x _{2} = 1, and x _{3} = 0 produce a model value logodds = l _{2} = a + b _{2}, which represents the risk from smoking 21 to 30 cigarette per day. The values x _{1} = x _{2} = x _{3} = 0 establish a baseline or referent group (logodds = l _{0} = a) that reflects the CHD risk among nonsmokers.
Table 7–10. CHD by smoking: a 2 × k table
Cigarettes per day 


0 
1–20 
21–30 
> 30 
Total 

CHD 
98 
70 
50 
39 
257 
No CHD 
1554 
735 
355 
253 
2897 
Total 
1652 
805 
405 
292 
3154 
Aside: In Chapter 6 it was noted that the values of the k level variable in a 2 × k table can have at least three forms: numeric, ordered, and nominal. Each variable type suggests a specific analytic approach: regression, ridit, and chisquare, respectively. In using a logistic model to analyze categorical data, similar issues arise. Categories that can be characterized numerically are typically used directly in the regression analysis, thereby implying a specific relationship among the values. However, nominal categorical values have no specific ordering, and no logical numeric ordering is possible. A nominal variable is incorporated into a logistic regression equation using a binary indicator variable. For example, if two races are being considered, say white and AfricanAmerican, one is coded 0 and the other 1. The logistic regression coefficients and the odds ratios are then relative to the variable coded zero. If the odds ratio is 2 and whites are coded as 0 and AfricanAmericans as 1, then the odds are two times greater among AfricanAmericans than among whites.
The principle of using an indicator variable can be extended. When more than two categories are present (for example, whites, AfricanAmericans, Hispanics, and Asians), a series of indicator variables is used. For k categories, k – 1 indicator variables, each taking on the values 0 or 1, allow the analysis to include unordered categorical variables in a regression model. Such a variable is frequently called a design variable. Again, a baseline category is established by setting the k – 1 components of the design variable equal to zero. The members of other categories are identified by a single component of the design variable that takes on the value one while the remaining components are zero. For example, if AfricanAmericans, Hispanics, and Asians are to be compared relative to the whites, then for k = 4 categories,
Whites: x _{1} = 0, x _{2} = 0, and x _{3} = 0
Blacks: x _{1} = 1, x _{2} = 0, and x _{3} = 0
Hispanics: x _{1} = 0, x _{2} = 1, and x _{3} = 0
Asians: x _{1} = 0, x _{2} = 0, and x _{3} = 1.
Like the binary case, the resulting logistic regression coefficients measure the role of each category relative to the category with all components set equal to zero. Any number of categories can be identified with this scheme, which allows the assessment of risk by using a logistic model without requiring a numeric value or even requiring that the categories be ordered. Many software analysis systems set up design variables automatically.
Values illustrating a design variable to identify four race/ethnic categories are given in Table 7–11 . Twelve individuals (three whites, four AfricanAmericans, two Hispanics, and three Asians) make up the “data” set.
The logistic model describing four categories of smoking exposure and CHD risk is not restricted in any way because this saturated model can produce any pattern of logodds values. A model that does not imply a particular pattern is said to be unconstrained. Estimates of the unconstrained model parameters are (notation is given in Table 6.2 )
Table 7–11. Illustration of the use of a design variable to establish membership in four race/ethnic categories
Individual 
Race/ethnicity 
x _{1} 
x _{2} 
x _{3} 

1 
AfricanAmerican 
1 
0 
0 
2 
White 
0 
0 
0 
3 
Asian 
0 
0 
1 
4 
White 
0 
0 
0 
5 
AfricanAmerican 
1 
0 
0 
6 
AfricanAmerican 
1 
0 
0 
7 
Hispanic 
0 
1 
0 
8 
Asian 
0 
0 
1 
9 
White 
0 
0 
0 
10 
Asian 
0 
0 
1 
11 
AfricanAmerican 
1 
0 
0 
12 
Hispanic 
0 
1 
0 
The odds ratios measure the multiplicative risk for each smoking category relative to a nonsmokers (baseline = nonsmokers). For example, smoking more then 30 cigarettes a day produces an odds 2.444 (e ^{0.894}) times greater than the odds among nonsmokers. Like all saturated models, the same odds ratio can be directly calculated from the data: . The odds ratios associated with increasing levels of smoking show an increasing risk of CHD (increasing from 1.0 to 1.5 to 2.2 to 2.4; Table 7–12 and Fig. 7–3 ). This increasing pattern is a property of the data and not the model. Any pattern, as mentioned, can emerge from an unconstrained model.
One reason to construct a 2 × k table is to explore the question of whether a risk factor has a specific pattern of influence on the probability of disease. For example, does the likelihood of a coronary event increase in a specific pattern
Table 7–12. CHD logistic model: smoking exposure (unconstrained and saturated model)
Variable 
Term 
Estimate 
SE 
pvalue 


Constant 
â 
− 2.764 
0.104 

Smoking (1–20) 

0.412 
0.163 
< 0.001 
1.510 
Smoking (21–30) 

0.804 
0.183 
< 0.001 
2.233 
Smoking (> 30) 

0.894 
0.201 
< 0.001 
2.444 
− 2 loglikelihood = 1751.695; number of model parameters = 4.
Computergenerated maximum likelihood estimates for the parameters of the linear model from the smoking data are â = – 2.725 and (Fig. 7–3 ) and . The logodds increases by an amount 0.325 between each of the four smoking categories. The loglikelihood statistic that best reflects the relative fit of the twoparameter constrained model is L _{0} = 1753.045 (number of number parameters = 2). The previous unconstrained saturated model yields a loglikelihood statistic of L _{1} = 1751.695 (number of number parameters = 4). The difference in loglikelihood statistics has an approximate chisquare distribution with two degrees of freedom when the models differ strictly because of sampling variation (X ^{2} = L _{0} – L _{1} = 1.350, pvalue = P(X ^{2} ≥ 1.350  no systematic difference) = 0.509). The comparison indicates that using a single straight line to represent the smoking/CHD relationship differs little from the saturated, unconstrained model (again Fig. 7–3 ). The straightline model adequately and simply describes the relationship between smoking and CHD risk using two instead of four parameters. Consequently, the effect of the smoking categories can be effectively represented as increasing the risk of a CHD event in a parsimonously multiplicative pattern. In symbols, implies and , where represents the odds ratio associated with the ith level of the risk factor relative to a baseline. For the smoking data, then, , where the baseline is the nonsmoking group (F_{i} = 0) and the estimated odds ratios increase multiplicatively (Table 7–13 ).
The linear constrained model translates into estimated logistic probabilities since
Table 7–13. CHD risk: smoking exposure (straightline model)
Levels 
F = 0 
F = 1 
F = 2 
F = 3 

Odds ratio (or _{0i }) 
(1.384)^{0} 
(1.384)^{1} 
(1.384)^{2} 
(1.384)^{2} 
Odds ratio (or _{0i }) 
1.000 
1.384 
1.916 
2.651 

0.062 
0.083 
0.112 
0.147 
THE 2 × 2 × k TABLE
When a logistic model is used to explore the relationships within the sampled data, there are two typical ways to begin; the simplest model or the most complex model. In the first case, variables are added to the model until a satisfactory description is achieved (the “forward” method). In the second case, variables are removed from the most complex model until a simpler but satisfactory statistical structure is found (the “backward” method). In this section the most complicated (most parameters) model serves as a starting point for analyzing the simultaneous influences of k levels of one categorical risk factor and two levels of another, a 2 × 2 × k table. The most complicated model (i.e., the saturated model) is logodds = a + b _{1} x _{1} + b _{2} x _{2} + b _{3} x _{3} + cC + d _{1} Cx _{2} + d _{2} Cx _{2} + d _{3} Cx _{3} where the x_{i} values are, as before, components of a design variable indicating the categories of the fourlevel risk factor. The variable C = 1 identifies that another risk factor is present, and C = 0 identifies that it is absent. The three terms incorporate into the model an interaction between the klevel and the twolevel risk factor variables. In this context, an interaction is the failure of the fourlevel categorical variable to have the same relationship with the outcome variable at both levels of the variable C. All direct influences and all possible interactions are included in this statistical model, yielding an eightparameter and, therefore, a saturated model. To be concrete, suppose x _{0}, x _{1}, x _{2}, and x _{3} again indicate four smoking categories and C indicates the two categories of behavior type (A type coded C – 1 and B type coded C = 0). The WCGS data for the four smoking and the two behavior categories form a 2 × 2 × 4 table (Table 7–14 ). A logistic model applied to these data produces eight estimated parameters (Table 7–15 ).
Summary values calculated from a saturated model, as before, are the same as those calculated directly from the data themselves. The contingency table produces eight logodds values, and the model employs eight parameters. Although a saturated model is not a summary of the sample of data, it provides a baseline against which to compare the utility of reduced models (models with fewer parameters) that are simpler representations of the relationships (p.214)
Table 7–14. CHD logistic model: smoking by behavior types (A/B) data
Cigs/day 


0 
1–20 
21–30 
> 30 
Total 

TypeA behavior 

CHD 
70 
45 
34 
29 
178 
No CHD 
714 
353 
185 
159 
1411 
Total 
784 
398 
219 
188 
1589 
TypeB behavior 

CHD 
28 
25 
16 
10 
79 
No CHD 
840 
382 
170 
94 
1486 
Total 
868 
407 
186 
104 
1565 
When the two risk factors do not interact, the logodds model becomes
Geometrically, the additive model is represented by two parallel lines (Fig. 7–4 ). The pattern is not constrained to either increase or decrease with increasing levels of smoking, but CHD risk associated with typeA and typeB individuals differs by a constant amount (measured by the estimated coefficient ĉ = 0.815 from the model) on a logodds scale for all four levels of smoking exposures.
Table 7–15. CHD logistic model: smoking and behavior type (saturated)
Term 
Estimate 
SE 
pvalue 



Constant 
â 
− 3.401 
0.192 

Smoking (1–20) 

0.675 
0.282 
0.017 
1.963 
Smoking (21–30) 

1.038 
0.324 
0.001 
2.824 
Smoking (> 30) 

1.160 
0.384 
0.003 
3.191 
A/B 
ĉ 
1.079 
0.229 
< 0.001 
2.941 
Interaction 

− 0.412 
0.347 
0.235 
0.662 
Interaction 

− 0.410 
0.395 
0.299 
0.664 
Interaction 

− 0.540 
0.452 
0.232 
0.583 
− 2 loglikelihood = 1713.694; number of model parameters = 8.
Table 7–16. CHD logistic model: smoking by behavior type (no interaction)
Variable 
Term 
Estimate 
SE 
pvalue 


Constant 
â 
− 3.223 
0.139 

Smoking (1–20) 

0.401 
0.164 
0.014 
1.493 
Smoking (21–30) 

0.761 
0.185 
< 0.001 
2.141 
Smoking (> 30) 

0.775 
0.203 
< 0.001 
2.170 
A/B 
ĉ 
0.815 
0.141 
< 0.001 
2.259 
− 2 loglikelihood = 1716.069; number of model parameters = 5.
(p.216) The increase in the loglikelihood value incurred by deleting the interaction terms from the model is evaluated by comparing the saturated model and the nointeraction model loglikelihood statistics (X ^{2} = L _{0} – L _{1} = 1716.069 – 1713.694 = 2.375 with three degrees of freedom producing the pvalue = P(X ^{2} ≥ 2.375  d _{1} = d _{2} = d _{3} = 0) = 0.498). The comparison shows no persuasive evidence that smoking exposure and behavior type interact, thereby implying that the two risk factors are usefully represented as additive influences on CHD risk measured on a logodds scale (multiplicatively on an odds scale). For example, the typeA (C = 1) individuals who smoke more than 30 cigarettes per day (x _{1} = 0, x _{2} = 0, x _{3} = 1) have an estimated odds ratio of relative to typeB individuals (C = 0) who are nonsmokers (x _{1} = x _{2} = x _{3} = 0). Figure 7–4 contrasts the nointeraction model with the data (saturated model).
A further step in building a description of the relationships of the three variables in a 2 × 2 × k contingency table is to postulate a model that is linearly constrained. When the influence from the klevel factor is linearly related to the risk of disease, the logodds model becomes
Geometrically, this model requires the logodds values from a 2 × 2 × k contingency table to be random deviations from two straight lines with different slopes and intercepts—one line within each level of the dichotomous variable C. In terms of the example, a straight line for typeA individuals and a straight line for typeB individuals reflects the logodds measure of CHD risk associated with the four levels of smoking exposure. Specifically, this statistical model expressed as two straight lines is
These two straight lines do not differ in principle from the straightline logistic model used to analyze the 2 × k contingency table (previous section) since this constrained, interaction model can be viewed as summarizing two separate (p.217)
Table 7–17. CHD logistic model: smoking by behavior type (constrained with interaction)
Variable 
Term 
Estimate 
SE 
pvalue 


Constant 
â 
− 3.305 
0.164 

Smoking 

0.423 
0.108 
< 0.001 
1.527 
A/B 
ĉ 
1.005 
0.198 
< 0.001 
2.733 
Interaction 

− 0.190 
0.130 
0.144 
0.827 
− 2 loglikelihood = 1715.859; number of model parameters = 4.
Using the estimated parameters, expressions for the two estimated straight lines are
A further refinement of this constrained model postulates that the response to the klevel risk factor is the same for both levels of the dichotomous risk factor. This nointeraction model is achieved by setting d = 0 in the previous model, yielding an additive, threeparameter expression where
Again, the data are described by two straight lines, one within each category of the dichotomous risk factor, but the lines have identical slopes—namely, b. Geometrically, once again, “parallel” is synonymous with “no interaction.” Using this simple additive model, estimates of the three logistic model parameters are given in Table 7–18 .
The comparison of the loglikelihood statistics from the two constrained models indicates that describing the WCGS data (Table 7–14 ) with two parallel straight lines is not misleading and provides an adequate description of CHD risk (p.218)
Table 7–18. CHD logistic model: smoking exposure by behavior type (constrained with no interaction)
Variable 
Term 
Estimate 
SE 
pvalue 


Constant 
â 
− 3.171 
0.129 

Smoking 

0.290 
0.060 
< 0.001 
1.336 
A/B 
ĉ 
0.808 
0.141 
< 0.001 
2.244 
− 2 loglikelihood = 1717.972; number of model parameters = 3.
The expressions for the two estimated parallel lines (same slope with different intercepts) are
Table 7–19. CHD risk by smoking exposure and behavior type (parallel straightline model)
Levels 


F = 0 
F = 1 
F = 2 
F = 3 

Type A 
2.244 
2.998 
4.008 
5.356 
Type B (or _{0i }) 
1.000 
1.336 
1.786 
2.387 
These estimated odds ratios succinctly summarize the separate roles of behavior type and smoking in the risk of a coronary event incurring a modest lack of fit to achieve a simple and easily interpreted threeparameter model.
To a large extent, the analysis of trend is contained in the previous discussion, but it is worth focusing specifically on this aspect of a logistic regression model. Data on smoking exposure and CHD frequencies along with a series of odds ratios estimated under different conditions are given in Table 7–20 .
The directly calculated odds ratios and the odds ratios estimated from the logistic model do not substantially differ. The model
Table 7–20. Odds ratios: smoking and CHD
F_{i} 
0 
1 
2 
3 

CHD 
98 
70 
50 
39 
No CHD 
1554 
735 
355 
253 
direct 
1.0 
1.510 
2.233 
2.444 
model 
1.0 
1.384 
1.916 
2.651 
adjusted 
1.0 
1.378 
1.900 
2.620 
Adjustment for the influence of any number of variables follows the same pattern. The effect of each variable is accounted for by including a term in an additive logistic model. More exactly, the model
THE MULTIWAY TABLE
A logistic model directly extends to summarizing relationships among three categorical risk variables and a binary disease outcome. Interest is again focused on evaluating the impact of behavior type (two levels) on the risk of a coronary event while accounting for the influences from smoking (k = four levels) and blood pressure (two levels). As before, the blood pressure variable is defined as a binary variable based on systolic blood pressure measurements (< 140 and ≥ 140). These three risk variables produce a 2 × 4 × 4 contingency table of WCGS data (Table 7–21 ). The same data, in a different format, are given at the beginning of the chapter (Table 7–1 ).
The 16parameter saturated model estimated from the WCGS data (Table 7–21 ) serves as a point of comparison for exploring reduced models. The loglikelihood (p.222)
Table 7–21. CHD by smoking by A/B by blood pressure
Cigs/day 


0 
1–20 
21–30 
> 30 
Total 

Type A and blood pressure ≥ 140 

CHD 
29 
21 
7 
12 
69 
No CHD 
155 
76 
45 
43 
319 
Total 
184 
97 
52 
55 
388 
Type A and blood pressure < 140 

CHD 
41 
24 
27 
17 
109 
No CHD 
559 
277 
140 
116 
1092 
Total 
600 
301 
167 
133 
1201 
Type B and blood pressure ≥ 140 

CHD 
8 
9 
3 
7 
27 
No CHD 
171 
62 
31 
14 
278 
Total 
179 
71 
34 
21 
305 
Type B and blood pressure < 140 

CHD 
20 
16 
13 
3 
52 
No CHD 
669 
320 
139 
80 
1208 
Total 
689 
336 
152 
83 
1260 
A basic question to be addressed arises: To what extent can the 16parameter saturated model be reduced (fewer parameters) and still maintain a faithful but simpler representation of the relationships among the three risk variables (typeA/B behavior, smoking, blood pressure) and the probability of a CHD event? A “minimum” model that accounts for influences from the three risk factors involves four parameters, no interactions among the risk factors, and smoking exposure constrained to have the same linear impact on CHD risk within the four levels of the other two risk variables. This fourparameter logistic model is
Geometrically the model represents the relationships within the 2 × 4 × 4 table as four parallel straight lines on a logodds scale (one for each of four blood pressure/behavior type categories), each describing the same linear increase in (p.223)
Table 7–22. CHD logistic model: A/B, smoking exposure, and blood pressure (no interaction and linearly constrained model)
Variable 
Term 
Estimate 
SE 
pvalue 


Constant 
â 
− 3.365 
0.136 

Smoking 

0.286 
0.060 
< 0.001 
1.331 
A/B 
ĉ 
0.769 
0.142 
< 0.001 
2.157 
Blood pressure 

0.779 
0.139 
< 0.001 
2.179 
− 2 loglikelihood = 1688.422; number of model parameters = 4.
(p.224) The utility of this fourparameter logistic model is evaluated by contrasting loglikelihood statistics—reduced (L _{0}) versus saturated (L _{1})—where X ^{2} = L _{0} – L _{1} = 1688.422 – 1667.138 = 21.284 with 12 degrees of freedom, producing a pvalue = P(X ^{2} ≥ 21.284  model fits) = 0.046. The phrase “model fits” means that the differences between the observed values and the modelgenerated values are due to chance alone. This considerably simpler logistic model is not an extremely accurate reflection of the relationships among the risk factors and the logodds associated with a coronary event. The increase in the loglikelihood statistic illustrates the typical tradeoff between lack of fit and simplicity of the model. Simpler models fit less well. Although the simpler model is not ideal, it gives an approximate measure of the influences of the three risk factors on CHD risk as if they had independent effects, producing an extremely parsimonious description (Table 7–23 ).
The odds ratios in Table 7–23 are derived from combinations of the parameter estimates , ĉ, and using the relationship . The odds ratios associated with the 15 different levels of the risk factors are relative to the baseline category—nonsmokers (F = 0), type B (C = 0), and blood pressure < 140 (D = 0) and or = 1.
GOODNESSOFFIT: DISCRETE CASE
For contingency table data, the goodnessoffit of a logistic model is assessed in typical fashion by generating a series of expected values based on the model (denoted e_{k} ) and comparing these values to the observed counts (denoted o_{k} ) using a Pearson chisquare statistic. The estimated model parameters generate a logistic probability that produces the estimated number of individuals with and without CHD events (Table 7–24 ).
The modelgenerated frequencies for each of the 32 cells in the 2 × 4 × 4 table based on the fourparameter logistic model (Table 7–22 ) produce 32 comparisons . The logistic probabilities are estimated using the relationship
Table 7–23. CHD by A/B by systolic blood pressure: odds ratios
Blood pressure 
Behavior type 
F = 0 
F = 1 
F = 2 
F = 3 

≥ 140 
A 
4.700 
6.256 
8.327 
11.083 
≥ 140 
B 
2.179 
2.900 
3.860 
5.138 
< 140 
A 
2.157 
2.871 
3.821 
5.086 
< 140 
B 
1.000 
1.331 
1.772 
2.358 
Table 7–24. Goodnessoffit: various summaries
Outcome 
Blood pressure 
F_{i} 
o_{k} 
e_{k} 
o_{k} − e_{k} 

(o_{k} − e_{k} )^{2}/e_{k} 

Behavior type A 

CHD 
≥ 140 
0 
29 
25.722 
3.278 
0.646 
0.418 
CHD 
≥ 140 
1 
21 
17.251 
3.749 
0.903 
0.815 
CHD 
≥ 140 
2 
7 
11.625 
− 4.625 
− 1.357 
1.840 
CHD 
≥ 140 
3 
12 
15.239 
− 3.239 
− 0.830 
0.689 
No CHD 
≥ 140 
0 
155 
158.278 
− 3.278 
− 0.261 
0.068 
No CHD 
≥ 140 
1 
76 
79.749 
− 3.749 
− 0.420 
0.176 
No CHD 
≥ 140 
2 
45 
40.375 
4.625 
0.728 
0.530 
No CHD 
≥ 140 
3 
43 
39.761 
3.239 
0.514 
0.264 
CHD 
< 140 
0 
41 
42.027 
− 1.027 
− 0.158 
0.025 
CHD 
< 140 
1 
24 
27.428 
− 3.428 
− 0.655 
0.428 
CHD 
< 140 
2 
27 
19.663 
7.337 
1.655 
2.738 
CHD 
< 140 
3 
17 
20.062 
− 3.062 
− 0.684 
0.467 
No CHD 
< 140 
0 
559 
557.973 
1.027 
0.043 
0.002 
No CHD 
< 140 
1 
277 
273.572 
3.428 
0.207 
0.043 
No CHD 
< 140 
2 
140 
147.337 
− 7.337 
− 0.604 
0.365 
No CHD 
< 140 
3 
116 
112.938 
3.062 
0.288 
0.083 
Behavior type B 

CHD 
≥ 140 
0 
8 
12.422 
− 4.422 
− 1.255 
1.574 
CHD 
≥ 140 
1 
9 
6.411 
2.589 
1.023 
1.046 
CHD 
≥ 140 
2 
3 
3.968 
− 0.968 
− 0.486 
0.236 
CHD 
≥ 140 
3 
7 
3.141 
3.859 
2.177 
4.741 
No CHD 
≥ 140 
0 
171 
166.578 
4.422 
0.343 
0.117 
No CHD 
≥ 140 
1 
62 
64.589 
− 2.589 
− 0.322 
0.104 
No CHD 
≥ 140 
2 
31 
30.032 
0.968 
0.177 
0.031 
No CHD 
≥ 140 
3 
14 
17.859 
− 3.859 
− 0.913 
0.834 
CHD 
< 140 
0 
20 
23.018 
− 3.018 
− 0.629 
0.396 
CHD 
< 140 
1 
16 
14.778 
1.222 
0.318 
0.101 
CHD 
< 140 
2 
13 
8.771 
4.229 
1.428 
2.039 
CHD 
< 140 
3 
3 
6.256 
− 3.256 
− 1.302 
1.694 
No CHD 
< 140 
0 
669 
665.982 
3.018 
0.117 
0.014 
No CHD 
< 140 
1 
320 
321.222 
− 1.222 
− 0.068 
0.005 
No CHD 
< 140 
2 
139 
143.229 
− 4.229 
− 0.353 
0.125 
No CHD 
< 140 
3 
80 
76.744 
3.256 
0.372 
0.138 
Total 
3154 
3154 
0.0 
0.0 
22.126 
The fourparameter logistic model, as seen in Table 7–24 (columns 7 and 8), is a reasonable representation of the observed frequencies in most categories. That is, the contributions to the chisquare statistic are small. Only a few categories are seriously misrepresented by the additive, constrained fourparameter logistic model. For example, individuals with high blood pressure who are type B and smoke more than 30 cigarettes per day are poorly predicted by the model (e _{20} = 3.141, o _{20} = 7 and (o _{20} – e _{20})^{2}/e _{20} = 4.741).
Although the fourparameter logistic model is not an ideal representation of the relationships among the risk variables, it unambiguously addresses the questions suggested by the table presented at the beginning of this chapter (Table 7–1 ). Blood pressure, behavior type, and smoking are represented as additive influences on the likelihood of a CHD event, allowing the influences of these risk factors to be evaluated separately (no interaction). Statistical tests demonstrate that these odds ratios represent substantial influences (not likely due to chance alone; pvalues < 0.001). Therefore, the relative impacts on the likelihood of CHD estimated by odds ratios are (behavior type), or_{bp} = 2.157 (blood pressure type), and (smoking more than 30 cigarettes per day). The three variables show about equal influence. Individuals with high blood pressure (greater than 140), with typeA behavior and who smoke more than 30 cigarettes per day have a combined risk (measured in terms of odds) 11 times the risk of individuals who have low blood pressure, are type B, and are nonsmokers. Specifically, the joint effect of these three risk factors produces an estimated odds ratio of (2.179)(2.157)(2.358) = 11.08. The logodds measure of risk of CHD increases consistently (linearly) and independently as the level of smoking increases. The description of the risk factors and inferences based on the estimated additive model are unequivocally a function of the model parameters. However, as illustrated (p.227) by the CHD data, the correspondence between the model and the data is frequently a difficult issue.
SUMMARIZING A SERIES OF 2 × 2 TABLES
As already noted, one way to control for the confounding influence of a variable is to stratify the data into a series of more or less homogeneous groups based on values of the confounding variable (discussed in Chapter 2 and again in Chapter 10 ). One result of this process is a series of 2 × 2 tables (one table per stratum). For example, a series of strata formed on the basis of age might each contain individuals classified by the presence or absence of a coronary event, as well as by behavior type (A or B). To summarize the information contained in a series of 2 × 2 tables, three issues arise:

1. Interaction (homogeneity): Is the association between risk factor and disease the same for all tables (all strata)?

2. Association (independence): If the association is the same, then is it substantial and not likely due to random variation?

3. Estimation: If the association is the same and it is not likely due to random variation, then what is the magnitude of the risk factor/disease association?
Modelfree methods answer these questions and are briefly reviewed. In addition, the parallel answers from a multivariable logistic model analysis are presented.
Test of Homogeneity
A single value summarizing the relationships within a series of 2 × 2 tables is most useful when the relationships summarized are the same for all tables (no interaction). In terms of an odds ratio, if a series of 2 × 2 tables reflect the same degree of association between risk factor and disease outcome (random fluctuations from a common overall value), then a single summary odds ratio serves as an accurate measure of that association. Otherwise, as with all interactions, a single summary odds ratio is not easily interpreted and can be entirely misleading.
Woolf [ 6 ] proposed a modelfree approach to assessing the observed differences among a series of independent odds ratios (homogeneity). Again using the notation described earlier (Chapter 3 and Appendix C ), the null hypothesis (p.228) states that the odds ratio estimates calculated from each of k tables differ only because of random variation, or , where or_{i} is the odds ratio associated with the ithspecific 2 × 2 table. To develop a teststatistic to assess this hypothesis of no interaction, instead of the odds ratios themselves, the logarithms of the odds ratios are used. This transformation produces estimates with approximately normal distributions (Appendix C ).
A stratumspecific (ith stratum) odds ratio is estimated by
The intuitive rationale for these weights is that, in determining the overall summary value, reliable estimates (small variance) should have relatively large weight, and unreliable estimates (large variance) should have relatively little weight.
Table 7–25. Notation for a specific 2 × 2 table: coronary disease and behavior type counts
Behavior type 
CHD 
No CHD 
Total 

A 
a_{i} 
b_{i} 
a_{i} + b_{i} 
B 
c_{i} 
d_{i} 
c_{i} + d_{i} 
Total 
a_{i} + c_{i} 
b_{i} + d_{i} 
n 
Table 7–26. WCGS data: age by behavior type by CHD outcome
Age 
Type A 
Type B 
Total (n_{i} ) 


CHD (a_{i} ) 
No CHD (b_{i} ) 
CHD (c_{i} ) 
No CHD (d_{i} ) 

< 40 
20 
241 
11 
271 
543 
40–44 
34 
462 
21 
574 
1091 
45–49 
49 
337 
21 
343 
750 
50–54 
38 
209 
17 
184 
448 
> 54 
37 
162 
9 
114 
322 
Total 
178 
1411 
79 
1486 
3154 
The question of whether the individual odds ratios (or_{i} ) systematically differ from the overall odds ratio (or) is assessed by the Woolf test for homogeneity using the teststatistic
The WCGS data once again illustrate. Study participants divided into five age groups to examine the relationship between behavior type and CHD risk allows the estimation of a summary odds ratio “removing” the influence of age (Table 7–26 ). These five separate 2 × 2 tables each generate an independent odds ratio to measure the association between behavior type and CHD events within each of the five strata (Table 7–27 , column 2). These odds ratios are minimally affected by the age of the study participants because the individuals within each stratum are essentially the same age.
Table 7–27. WCGS data: odds ratios and logodds ratios by age category
Age 



w_{i} 

< 40 
2.045 
0.715 
0.149 
6.723 
40–45 
2.012 
0.699 
0.081 
12.355 
45–49 
2.375 
0.865 
0.074 
13.530 
50–54 
1.968 
0.677 
0.095 
10.487 
> 54 
2.893 
1.062 
0.153 
6.532 
Total 
2.373 
0.864 
49.627 
(p.230) Using the example data, , and . The test of homogeneity ( with four degrees of freedom yields a pvalue = 0.934) gives no reason to believe that the odds ratios systematically differ among the five age categories. That is, no evidence exists of an interaction.
Test of Association
A second step in summarizing a series of kseparate 2 × 2 tables is to assess the association between the risk factor and disease, using the data from all k tables. One such test is called the MantelHaenszel chisquare test [ 7 ]. William Cochran suggested essentially the same test in an earlier paper [ 8 ]. The approach generates cell frequencies estimated within each 2 × 2 table as if the risk factor and disease are independent (i.e., the null hypothesis). Again (see Chapter 6 ), the estimated cell frequency is
The MantelHaenszel chisquare statistic combines information from each table, resulting in a summary teststatistic that reflects the association between risk factor and disease outcome as long as the odds ratios are homogeneous (i.e., have no interaction). The value has an approximate chisquare distribution with one degree of freedom when risk factor and disease are unrelated in all k strata. Continuing the WCGS example, since Σa_{i} = 178 (observed) and (p.231) (expected), then (pvalue = P(X ^{2} ≥ 32.663  independence) < 0.001), producing strong evidence that behavior type and CHD remain associated after accounting for influence from age.
Estimation of a Common Odds Ratio
The third step in summarizing a series of 2 × 2 tables is the estimation of a common measure of association. A popular estimate that provides an overall measure of association is the MantelHaenszel summary odds ratio [ 7 ] given by
Aside: The MantelHaenszel estimate combines a series of ratios to produce a single summary ratio. A summary ratio is typically constructed from a weighted average of each of a series of k ratios where
The choice of the weights w_{i} determines the properties of the resulting summary ratio. The simplest choice of weights is w_{i} = 1.0, givingwhere k ratios are directly averaged to form a single estimate. (p.232)A more sophisticated estimate is based on the weights w_{i} = x_{i} . That is, the “worth” of each ratio is proportional to the denominator x_{i} , giving
This form is the most common way ratios are combined from k sources of data.If the weights are chosen so the , then
which is an estimate of the slope of a straight line (least squares estimate) through the origin.The weights that yield the MantelHaenszel estimate of the summary odds ratio are w_{i} = b _{i} c _{i}/n_{i} from each of the k individual odds ratios and
This choice of weights produces an efficient and effective summary estimate of the common odds ratio. Other choices are possible and have other properties.
The Woolf summary odds ratio is similarly an estimate of the odds ratio common to the k = 5 homogeneous 2 × 2 tables. In addition, the estimated variance of the Woolf’s estimate of the logarithm of the summary odds ratio is particularly simple. It is the reciprocal of the sum of the weights w_{i} (Table 7–27 ). In symbols, the estimated variance of is variance = 1/Σw_{i} . For the agestratified data,
Logistic Regression Approach
The Woolf test for interaction, the MantelHaenszel test for association, and the MantelHaenszel summary odds ratio have analogous measures estimated from (p.233) a logistic regression model. In most situations, the results are similar from a seemingly different approach.
Four logistic models relating CHD outcome to the risk factors age (five categories) and behavior type are necessary:

1. Saturated model: logodds = a + bF + c _{1} x _{1} + c _{2} x _{2} + c _{3} x _{3} + c _{4} x _{4} + d _{1} Fx _{1} + d _{2} Fx _{2} + d _{3} Fx _{3} + d _{4} Fx _{4}.
The variable F represents typeA (F = 1) and typeB (F = 0) behavior, and x _{1}, x _{2}, x _{3}, and x _{4} are the components of a design variable that accounts for the five age categories. The loglikelihood statistic associated with this saturated 10parameter model is L _{1} = – 2 loglikelihood = 1702.156.

2. Additive model (behavior type and age—no interaction): logodds = a + bF + c _{1} x _{1} + c _{2} x _{2} + c _{3} x _{3} + c _{4} x _{4} (see Table 7–28 , top).

3. Additive model (age only—b = 0, no influence from behavior type): logodds = a + c _{1} x _{1} + c _{2} x _{2} + c _{3} x _{3} + c _{4} x _{4} (see Table 7–28 , middle).

4. Additive model (behavior type only—c _{1} = c _{2} = c _{3} = c _{4} = 0, no influence from age): logodds = a + bF (see Table 7–28 , bottom).
Table 7–28. Three additive models
Variable 
Term 
Estimate 
SE 
pvalue 
Odds ratio 

Behavior type and age—no interaction 

Constant 
â 
− 2.461 
0.192 

A/B 

0.793 
0.141 
< 0.001 
2.210 
Age 40–44 
ĉ _{1} 
− 0.112 
0.232 
0.629 
0.894 
Age 45–49 
ĉ _{2} 
0.510 
0.225 
0.023 
1.665 
Age 50–54 
ĉ _{3} 
0.793 
0.236 
< 0.001 
2.211 
Age > 54 
ĉ _{4} 
0.921 
0.246 
< 0.001 
2.512 
L _{2} = − 2 loglikelihood = 1703.010; number of model parameters = 6. 

Age only; b = 0—no influence from behavior type 

Constant 
â 
− 2.894 
0.192 

Age 40–44 
ĉ _{1} 
− 0.132 
0.231 
0.569 
0.877 
Age 45–49 
ĉ _{2} 
0.531 
0.224 
0.018 
1.700 
Age 50–54 
ĉ _{3} 
0.838 
0.234 
< 0.001 
2.311 
Age > 54 
ĉ _{4} 
1.013 
0.246 
< 0.001 
2.753 
L _{3} = − 2 loglikelihood = 1736.578; number of model parameters = 5. 

Behavior type only; c _{1} = c _{2} = c _{3} = c _{4} = 0—no influence from age 

Constant 
â 
− 2.934 
0.115 

A/B 

0.864 
0.140 
< 0.001 
2.373 
L _{4} = − 2 loglikelihood = 1740.344; number of model parameters = 2. 
Table 7–29. Comparison of four logistic models
Model 
L_{i} 
Likelihood 

SE 

Parameters 


1. 
A/B + age + interactions 
L _{1} 
1702.156 
0.715 
0.386 
2.045 
10 
2. 
A/B + age 
L _{2} 
1703.010 
0.793 
0.141 
2.210 
6 
3. 
Age only 
L _{3} 
1736.578 
— 
— 
— 
5 
4. 
A/B only 
L _{4} 
1740.344 
0.864 
0.140 
2.373 
2 
Summary results from these four models applied to the WCGS data allow a multivariable logistic analysis summarizing k separate 2 × 2 tables that parallels the three steps of the modelfree approach (Table 7–29 ).
To test for possible interaction effects (that is, do the odds ratios systematically differ among some or all of the five age categories?), two loglikelihood statistics are compared. The loglikelihood statistic calculated from the saturated model with the interaction terms included (model 1: L _{1} = 1702.156) is compared to the loglikelihood statistic from the model with the interaction terms ignored (model 2: L _{2} = 1703.010), producing an increase of X ^{2} = L _{2} – L _{1} = 0.854, which reflects the lack of homogeneity among the five odds ratios. The contrast of the loglikelihood values from these two logistic models shows no evidence of an interaction (pvalue = 0.931). This result is similar to the Woolf homogeneity chisquare value (pvalue = 0.834). These two approaches are likely to be similar in general, particularly when each stratum contains a moderate or a large number of observations.
A MantelHaenszellike test of the association between behavior type and CHD can be conducted in the context of an additive logistic model. The difference between the additive model containing the risk factors behavior type and age (model 2) and the model containing only age (model 3) indicates the degree of association between behavior type and CHD risk while accounting for the influence of age. These two additive (nointeraction) models produce a difference in loglikelihood statistics analogous to the MantelHaenszel teststatistic . The increase in loglikelihood values X ^{2} = L _{3} – L _{2} has a chisquare distribution with one degree of freedom when the risk variable is unrelated to the outcome (b = 0). Both the MantelHaenszel chisquare and the logistic model approaches require that no interactions exist among the k tables. For the age and behavior type data, the difference in loglikelihood values is X ^{2} = L _{3} – L _{2} = 1736.578 – 1703.010 = 33.568 (pvalue < 0.001), which is not very different from the MantelHaenszel chisquare value calculated to assess the same association . As with the tests to assess interaction effects, these two tests of association will usually be similar.
Table 7–30. Two approaches to summarizing k independent 2 × 2 tables (strata)
No model 
Logistic model 


Homogeneity 

X ^{2} = 0.854 
Association 

X ^{2} = 33.568 
Odds ratio 


Odds ratio 

— 
When a series of 2 × 2 tables shows no likely interaction effects, the strictly additive logistic model (model 2) produces an estimate of the strength of the association between risk factor and disease adjusted for confounding influences from the stratavariable. For the WCGS data, the behavior type coefficient is 0.793 (model 2) and the estimated adjusted odds ratio is , which is almost identical to . The 95% confidence interval based on the estimate is (0.517, 1.069), and the corresponding confidence for the odds ratio is (1.676, 2.914). Both estimates measure the influence of A/B behavior type in determining the risk of CHD, accounting for the influence of age. In general, the odds ratio estimated from the additive logistic model, the MantelHaenszel and the Woolf summary odds ratio estimates will be similar. To repeat, these three estimates require homogeneity of the odds ratios (no interactions) among the kindependent tables to summarize the association between behavior type and CHD. The maximum likelihood estimate from the additive logistic model is very slightly more efficient (smaller variance) than either or . Table 7–30 provides a summary comparison of the modelfree and logistic model approaches to summarizing an association from a series of kindependent 2 × 2 tables (Table 7–30 ).
The confounding influence of the age variable is identified by comparing values of the estimate from two specific additive logistic models. The value of estimated from the logistic model accounting for age is compared to the estimate from the model excluding age (model 2 versus model 4). When age in included, ; when age is excluded, . In terms of odds ratios, , and . In both cases, the difference measures the confounding influence of age on a measure of the relationship between behavior and CHD.
The MantelHaenszel summary odds ratio calculated from the five age strata is 2.214. The same odds ratio calculated ignoring the five age strata is 2.373. The difference also measures the confounding influence of age (bias) on the odds ratio.