Survival Data: Proportional Hazards Model
Survival Data: Proportional Hazards Model
Abstract and Keywords
The success of a modelbased approach depends on choosing a model that accurately reflects the relationships within the data. This choice requires knowledge of the statistical properties of the model and a clear understanding of the phenomenon being investigated. One of the many useful models applied to survival data is the proportional hazards model. This chapter describes this model in simple terms, illustrating its properties and providing insight into the process of analyzing survival experience data using statistical modeling techniques.
Keywords: survival data analysis, statistical models, survival curves, epidemiologic data analysis
It is rarely sufficient to demonstrate that one group of individuals has a significantly longer survival time than another. Pursuit of plausible explanations for the observed differences is an integral part of understanding survival experience. Survival time, like risk measured by a probability, is almost always affected by a number of interrelated factors. Factors such as age, severity of disease, past health status, and race/ethnicity provide additional information that likely improves the description of survival data. The investigation of the role of these explanatory variables usually requires employing a statistical model. Such a model formally relates a series of variables to an individual’s survival time. The analysis of the effect of these explanatory variables is conducted in much the same manner as the assessment of the risk variables in a logistic or Poisson regression model. Of course, a statistical model, at best, approximates the unknown underlying situation. Alternative approaches, however, are rarely possible without large amounts of data, making a statistical model a basic tool in the investigation of variables relevant to survival.
The success of a modelbased approach depends on choosing a model that accurately reflects the relationships within the data. This choice requires, as always, knowledge of the statistical properties of the model and a clear understanding of the phenomenon under investigation. One of many useful models applied to survival data is the popular proportional hazards model. In the following, this sometimes complex model is described in simple terms, with the dual purpose of illustrating its properties and providing insight into the process of analyzing (p.413) survival experience data using statistical modeling techniques. The multivariable analysis of survival data is a mathematically sophisticated topic, and a number of texts are completely devoted to the many aspects of the statistical analysis of failure time data (for example, [1], [2], and [3]). This chapter is a brief introduction to the application of one specific model.
SIMPLEST CASE
The simplest application of a proportional hazards model (sometimes called the “Cox” model, after the statistician D. R. Cox, who originated the analytic approach) involves the comparison of two groups made up of individuals with varying survival times, some of which may be censored. A small hypothetical data set of 12 subjects provides an introduction (Table 13–1). These data consist of two treatment groups (A and B), each with six individuals, and a total of nine complete and three incomplete survival times.
Before employing a proportional hazards model to compare the two treatments A and B, it is useful to apply the logrank test to assess differences in survival times for this small data set. When the analysis involves simply comparing two groups, the logrank approach is a special case of the proportional hazards model, as will be illustrated. The hypothetical data classified into nine 2 × 2 tables based on the nine complete survival times is the first step in the logrank procedure (Table 13–2).
The same data are fully displayed in Figure 13–1. The MantelHaenszel chisquare test (logrank test) for evaluating an association between two variables stratified into k strata (2 × 2 tables) is, once again (Chapters 7 and 12),
Table 13–1. Hypothetical data
Treatment A 
5 
8 
12 
22^{+} 
37 
41 
Treatment B 
23 
40 
43 
51^{+} 
53^{+} 
62 
Table 13–2. Hypothetical data displayed in nine 2 × 2 tables stratified by survival time: summary
Time (t) 
a_{i} 
a_{i} + b_{i} 
n_{i} 
a_{i} − Â_{i} 
Variance(a_{i}) 


5 
1 
6 
12 
6/12 
6/12 
396/1584 = 0.250 
8 
1 
5 
11 
5/11 
6/11 
300/1210 = 0.248 
12 
1 
4 
10 
4/10 
6/10 
216/900 = 0.240 
23 
0 
2 
8 
2/8 
−2/8 
84/448 = 0.188 
37 
1 
2 
7 
2/7 
5/7 
60/294 = 0.204 
40 
0 
1 
6 
1/6 
−1/6 
25/180 = 0.139 
41 
1 
1 
5 
1/5 
4/5 
16/100 = 0.160 
43 
0 
0 
4 
0/4 
0 
0/48 = 0.000 
62 
0 
0 
1 
0/1 
0 
— 
Total 
5 
2.257 
2.743 
1.428 
Aside: The expected value and the estimated variance of the distribution of a_{i} in the situation where no survival times are identical can be simplified over the previous more general expressions as
In its simplest form, the proportional hazards model postulates that the hazard function associated with survival times for individuals in treatment group A
The constant c can be estimated from the observed survival times. Continuing to use the hypothetical data, the estimated constant of proportionality for the two hazard rates associated with treatments A and B is ĉ = 0.168. The hazard rate for treatment B is then estimated to be about six times smaller than for treatment A for all times t [λ_{A}(t) = 5.962λ_{B}(t), or λ_{B}(t) = 0.168λ_{A}(t)]. The estimation of the parameter c, like the previous regression estimates, requires a computer algorithm.
In most cases, the collected survival data are too sparse to estimate reliably the hazard functions themselves. The estimate of the ratio c, however, provides a summary measure of the difference in survival experiences between two groups when the hazard functions are proportional. The entire data set (all 12 observations) is efficiently focused on the estimation of a single parameter.
A special property of the proportional hazards model is that the estimate of the proportionality constant (c) does not require the actual form of the hazard function to be specified. Thus, the comparison of the relative survival between the two groups can be efficiently summarized and statistically compared without knowledge or assumptions about the functional form of the hazard functions λ_{A}(t) or λ_{B}(t), as long as they are proportional. In addition, the estimation of the constant c makes use of both the complete and the incomplete observations in the sampled data. Not unlike the productlimit estimate of the survival probabilities, information from the censored observations is used to form an unbiased estimate of c when it is relevant and ignored when it is not.
In terms of survival curves, the property of proportionality of hazard functions translates to
Standard methods can be used to evaluate the effect of sampling variation on the estimate ĉ. To test the null hypothesis that the hazard functions are the same (H _{0}: c = 1), one effective technique is to compare two loglikelihood statistics. First a loglikelihood value is estimated under the conditions that λ_{B}(t) = λ_{A}(t) and a second value under the conditions that λ_{B}(t) = cλ_{A}(t). For the example data, the two loglikelihoods statistics are L _{c=1} = 31.996 when the compared hazard functions are the same and L _{c≠1} = 27.148 when they differ systematically. The loglikelihood values are produced as part of the estimation process. The increase X ^{2} = L _{c=1} − L _{c≠1} = 4.849 has an approximate chisquare distribution with one degree of freedom when the two groups differ by chance alone, producing a pvalue of P(X ^{2} ≥ 4.849  c = 1) = 0.028. The degrees of freedom are the difference in the number of parameters used to describe the two compared models. This comparison of loglikelihood statistics is essentially the same process used in logistic and Poisson regression analyses.
Alternatively, a statistical test of the estimate ĉ in terms of the log (ĉ) is equally effective. The teststatistic
Applied to the comparison of the two groups, the logrank test always gives results similar to the difference between loglikelihood statistics, particularly when large sample sizes are involved. For example, the logrank teststatistic X ^{2} = 5.286 (pvalue = 0.022) is close to the loglikelihood teststatistic X ^{2} = 4.849 (pvalue = 0.028) for the hypothetical data. The loglikelihood and logrank approaches are essentially the same in all cases because the parameter c of the proportional hazards model is estimated by a process that also stratifies the survival data on the time of failure and combines information from each stratum to estimate the overall constant of proportionality. Also similar to the logrank (p.417) procedure, the proportional hazards model is nonparametric in the sense that no need exists to specify the form of the hazard functions or survival curves to assess the relative influence of two treatments and, as mentioned, produces an estimate of the parameter c that is not biased by the presence of censored data. In other words, the logrank test in this twosample case is a special case of the proportional hazards model [4].
The data described in Chapter 12 reflecting the efficacy of two treatments for acute myelogenous leukemia (AML) [4] can be analyzed based on the conjecture that the hazard functions associated with the two treatments, maintained (M) and nonmaintained (NM), are proportional. The estimated constant of proportionality is ĉ = 0.405. Therefore, λ_{M}(t) = ĉλ_{NM}(t) = 0.405λ_{NM}(t). The risk of a relapse (reflected by proportional hazard functions) in the maintained group of leukemia patients is estimated to be considerably less than the risk experienced by the subjects receiving no special chemotherapy (a 2.5fold difference).
The difference in loglikelihood statistics produces X ^{2} = L _{c=1} − L _{c≠1} = 85.796 − 82.500 = 3.296, which has an approximate chisquare distribution with one degree of freedom and yields a pvalue of P(X ^{2} ≥ 3.296  no treatment effect) = 0.069. Assuming proportional hazard functions, the comparison of loglikelihood statistics produces borderline evidence of a systematic difference between these two treatments. This result is expectedly similar to the logrank analysis of the same data (X ^{2} = 3.396 with a pvalue = 0.065; Table 12–17).
To explore these data further, it is assumed that the survival experience in both nonmaintained and maintained groups is described by an exponential function (Chapter 12), providing a simple and direct comparison of the two groups. The exponential model produces estimated hazard rates of weeks (nonmaintained) and weeks (maintained), and the ratio based on these estimates is . The estimated hazards ratio based on the proportional hazards model is, again, . The similarity implies that the hazard rates are not only proportional but also likely constant with respect to time. The resulting estimated survival curves are displayed in Figure 13–2.
THE PROPORTIONAL HAZARDS MODEL
A general additive proportional hazards model with k explanatory variables is expressed as
A common and useful practice is to “center” the xvariables so that the proportional hazards model becomes
Two alternative forms of the additive proportional hazards model are
The basic proportional hazards model requires that the explanatory variables do not change over time. The influence of time and the influence of the explanatory variable are separate components of the model, which is another way of saying that the components of the constant of proportionality c_{i} are unrelated to time. The proportional hazards model, however, can be modified so that the explanatory variables also depend on time. A treatment could vary during the followup period or characteristics of individuals such as blood pressure or cholesterol levels could change over time. The analysis of survival time data in the presence of such timedependent explanatory variables is discussed elsewhere ([1] or [2]).
The hazard function and the survival curve are related. As mentioned, high hazard rates lead to low survival probabilities and conversely. Formally,
(p.420) In the special case where the survival times have an exponential distribution (i.e., S(t) = e ^{−λt}), then
The proportional hazards model is called a semiparametric model because it is made up of nonparametric and parametric components. The nonparametric property stems from the fact that the hazard functions are unspecified, and it is not necessary to describe this part of the model in a parametric form. The relationship between the explanatory variables and the survival times, however, is parametric. Specifically, the role of each explanatory variable is directly reflected by a parameter (b_{j}), which is a fundamental element of the proportional hazards model.
PLOTTING SURVIVAL CURVES
When two hazard functions are proportional, the survival curves do not cross, as noted. Suppose that two groups (1 and 2) have proportional hazard functions where c = 2, then S _{2}(t) = [S _{1}(t)]^{2}. These two survival curves do not cross because S _{2}(t) < S _{1}(t) for all values of t, but it is usual not obvious from the plots of survival curves that the hazard functions are proportional (Fig. 13–3, top). The transformation log{−log [S(t)]} is useful. For proportional hazards functions, this transformation produces survival curves that are parallel and differ only because of the influence of the explanatory variables. Figure 13–3 (bottom left) shows the transformed functions of S _{1}(t) and S _{2}(t). When the curves representing the transformed values of S _{1}(t) and S _{2}(t) are parallel, the plot of one against the other is a single straight line (Fig. 13–3, bottom right). In general, S _{2}(t) = [S _{1}(t)]^{c} requires log(−log[S _{1}(t)]) − log(−log[S _{2}(t)]) = log(c).
All mathematical functions that are proportional, by definition of proportional, do not cross. This fact can be applied to proportional hazard analysis as part of exploring the issues of whether the data support the use of the “Cox” model, because proportional hazards guarantees that the survival curves do not cross.
Simplifying transformations and the geometry of the survival curves are fundamental tools in the difficult task of deciding if the relationships within a data set are meaningfully represented by a proportional hazards model. The plot of a “loglog” transformation of the survival curves is a good place to start a goodnessoffit evaluation. However, the fact that the survival curves do not cross when the hazard functions are proportional does not mean that the estimates of the survival curves do not cross. Estimates, subject to random variation, can fluctuate (p.421)
FOUR APPLICATIONS OF A PROPORTIONAL HAZARDS MODEL
Application I: CD4 Counts, Serum β _{2}Microglobulin, and AIDS
A bivariate proportional hazards model addresses two fundamental questions: Do the two explanatory variables have independent influences? And if they do, what are their relative contributions to survival time? These two issues arise in the study of HIVpositive patients and the relationship of two predictors of AIDS: CD4 counts and serum β_{2}microglobulin levels. Past studies show levels of both measures correlate with progressively severe illness. A twovariable proportional hazards model describing the time to AIDS (“survival”) allows the evaluation of these two measures as a prognostic tools for AIDS among HIVinfected individuals.
The proportional hazards model
Data from the San Francisco Men’s Health Study [5] provide n = 348 seropositive homosexual or bisexual men who were interviewed every six months so that their time from HIV diagnosis to AIDS (in months) is known (timetoaids = “survival time”). Among the 348 study participants, 219 converted to AIDS (complete observations), and 129 remained HIVpositive (censored observations) over 40 months of followup. The CD4 counts and β_{2}microglobulin levels were measured when HIVpositive participants entered the study. Applying a bivariate proportional hazards model to these data provides estimates of the three parameters b _{1}, b _{2}, and b _{3} (Table 13–3).
The effect of the interaction can be assessed by postulating that the coefficient b _{3} is zero and, as usual, the teststatistic is then viewed as a random observation sampled from an approximate standard normal distribution. The associated pvalue is P(Z ≥ 0.604  b _{3} = 0) = 0.546.
Alternatively, the interaction model can be contrasted with the additive model (b _{3} set to zero). The proportional hazards model becomes
The increase in loglikelihood statistics measures the effect of excluding the interaction term from the model. The chisquare teststatistic is (one degree of freedom, yielding a pvalue
Table 13–3. Estimated model coefficients for CD4 counts and β_{2}microglobulin levels from a proportional hazards analysis including an interaction term
Term 
Coefficient 
SE 
pvalue 
Hazard ratio 


CD4 
−0.00149 
0.00125 
< 0.001 
0.999 

β_{2} 
0.00348 
0.00038 
0.016 
1.003 

CD4 × β_{2} 
− 0.0000023 
0.0000038 
0.546 
0.999 
−2 loglikelihood = 1995.591; number of model parameters = 3.
Table 13–4. Estimated model coefficients for CD4 counts and β_{2}microglobulin levels from a proportional hazards analysis excluding the interaction term: an additive model
Term 
Coefficient 
SE 
pvalue 
Hazard ratio 


CD4 
−0.00162 
0.00033 
< 0.001 
0.998 

β_{2} 
0.00404 
0.00082 
< 0.001 
1.004 
−2 loglikelihood = 1995.952; number of model parameters = 2.
Behaving as if CD4counts and β_{2}microglobulin levels have independent influences, the additive model produces estimates of the separate effects of these two indicators of survival time. Estimated hazard ratios are given by
A last issue concerns the confounding effects of these variables. Assessment of the confounding influence of a variable or variables using a proportional hazards model is not different in principle from the usual approach to assessing confounding. Two estimates are compared—one estimated from a model including and another estimated from a model excluding the confounding variable or variables. When the β_{2}microglobulin variable (x _{2}) is deleted from the bivariate proportional hazards model, the estimate of the coefficient associated with the CD4 counts is , which is somewhat different from the estimate when the influence of β_{2}microglobulin is retained in the model where (about a 23% difference). Similarly, the confounding influence of CD4 counts on (β_{2}microglobulin coefficient can be seen from the proportional hazards model excluding
Table 13–5. Estimated hazard ratios from the proportional hazards model for three levels of CD4 counts (x _{1}) and three levels of β_{2}microglobulin (x _{2})
CD4 (x _{1}) 
β_{2}microglobulin (x _{2}) 


200 
300 
400 

800 
1.000 
1.498 
2.245 
500 
1.626 
2.433 
3.646 
300 
2.243 
3.361 
5.037 
Application II: Auer rods, WBCCount, and Leukemia Survival
Data [6] collected to investigate the relationship between survival of patients with acute myelogenous leukemia, white blood cell (WBC) counts and a white cell morphologic characteristic can be explored using a proportional hazards model. The morphologic characteristic is the presence or absence of Auer rods, termed AGpositive and AGnegative. Thirtythree survival times (in weeks) and the white blood cell count for AGpositive and AGnegative leukemia patients are given in Table 13–6.
These survival times are complete (no censoring; all patients died), producing 17 AGpositive and 16 AGnegative observations. A simple correlation coefficient shows that the WBC count is related to survival time (correlation = −0.33), and the mean WBC counts appear to differ between AGstatus groups
Table 13–6. Survival times and white blood cell counts for AGpositive and AGnegative acute myelogenous leukemia patients
AGpositive 
AGnegative 


Patient 
Weeks 
WBC 
Patient 
Weeks 
WBC 
1 
65 
2,300 
18 
65 
3,000 
2 
156 
750 
19 
17 
4,000 
3 
100 
4,300 
20 
7 
1,500 
4 
134 
2,600 
21 
16 
9,000 
5 
16 
6,000 
22 
22 
5,300 
6 
108 
10,500 
23 
3 
10,000 
7 
121 
10,000 
24 
4 
19,000 
8 
4 
17,000 
25 
2 
27,000 
9 
39 
5,400 
26 
3 
28,000 
10 
143 
7,000 
27 
8 
31,000 
11 
56 
9,400 
28 
4 
26,000 
12 
26 
32,000 
29 
3 
21,000 
13 
22 
35,000 
30 
30 
79,000 
14 
1 
100,000 
31 
4 
100,000 
15 
1 
100,000 
32 
43 
100,000 
16 
5 
52,000 
33 
56 
4,400 
17 
65 
100,000 
Plotting the survival curves (AGpositive and AGnegative; Fig. 13–4, top) and the “loglog” transformed survival curves show no evidence that the assumption of proportional hazards is unrealistic. The plot of the transformed survival functions, AGpositive against AGnegative, yields an essentially straight line (slope = 1), indicating a strong likelihood that the ratio of hazard functions is constant with respect to time (Fig. 13–4, bottom).
A proportional hazards model relating WBC count and AGstatus to survival time (including the possibility of an interaction) is
It is likely that the relationship between logWBC count and survival pattern is not the same for the AGpositive and the AGnegative individuals. Specifically, the estimated coefficients and do not exclusively characterize the separate roles of the two explanatory variables, AGstatus and logWBC count. Wald’s test of the interaction coefficient and the comparison of loglikelihood statistics (difference = X ^{2} = 154.468 − 150.681 = 3.787, degrees of freedom = 1) produce small pvalues (both close to 0.05), indicating the definite possibility of a different relationship between logWBC count and survival for each kind of Auer rod. The additive model (Table 13–7), therefore, is not likely a useful description of the impact of the two explanatory variables. The presence of interaction, as usual, limits the amount of summarization.
Application III: Vital Capacity, Age, and Lung Cancer Survival
Preliminary observations of patients participating in a clinical trial [7] provide 131 lung cancer survival times, as well as the ages of the patient and their vital capacity (explanatory variables). The data are divided into two groups based on the patient’s measured vital capacity. One group consists of 95 patients with “high” vital capacity ratios, and the other consists of 36 patients with “low” vital capacity ratios. The vital capacity groups, patient’s age, and survival time (in days) are given in Tables 13–8 and 13–9.
Figure 13–5 (top) displays the productlimit estimated survival curves associated with these two vital capacity groups. The “high” vital capacity group appears to have better survival experience. Figure 13–5 (bottom) reflects a definite influence of age on the pattern of survival for these patients. The two vital capacity groups each stratified by age (individuals less than or equal to 65 years old and individuals greater than 65 years old) produces distinctly separate survival patterns. A summary of mean survival times additionally indicates that age influences survival (Table 13–10).
Two expected issues arise when survival data are classified into subgroups. The choice of the bounds for the categories adds an arbitrary element to the analysis that influences the final interpretations. Of more importance, the number of (p.427)
Table 13–7. Proportional hazards model: leukemia and WBC count
Term 
Coefficient 
SE 
pvalue 
Hazard ratio 


Full model 

AGstatus 
−5.040 
2.131 
0.018 
0.007 

logWBC 
0.145 
0.177 
0.412 
1.156 

logWBC × AGstatus 
0.531 
0.276 
0.055 
1.700 

−2 loglikelihood = 150.681; number of model parameters = 3. 

Additive model: interaction excluded 

AGstatus 
−1.069 
0.423 
0.013 
0.343 

logWBC 
0.368 
0.136 
0.007 
1.444 
−2 loglikelihood = 154.468; number of model parameters = 2.
Table 13–8. Lung cancer survival data: “high” vital capacity
Time 
Age 
Time 
Age 
Time 
Age 
Time 
Age 

Complete 

0 
74 
1 
74 
1 
63 
3 
78 
4 
66 
5 
40 
9 
65 
19 
51 
21 
73 
30 
62 
36 
68 
39 
50 
40 
56 
48 
64 
51 
72 
61 
58 
89 
64 
90 
41 
90 
69 
92 
76 
113 
73 
127 
64 
131 
51 
138 
75 
139 
56 
143 
50 
159 
60 
168 
74 
170 
71 
180 
69 
189 
56 
192 
68 
201 
64 
212 
58 
223 
70 
229 
76 
238 
63 
265 
65 
275 
63 
292 
55 
317 
65 
322 
55 
350 
54 
357 
73 
380 
51 

Censored 

62 
66 
75 
44 
77 
60 
81 
38 
83 
59 
83 
42 
84 
67 
86 
62 
88 
53 
92 
59 
98 
55 
104 
62 
116 
62 
129 
35 
131 
43 
162 
45 
167 
56 
173 
54 
178 
63 
179 
69 
184 
69 
184 
67 
194 
58 
256 
57 
263 
46 
269 
63 
338 
47 
344 
52 
347 
59 
349 
61 
350 
66 
362 
56 
362 
60 
364 
63 
364 
58 
364 
58 
365 
66 
368 
70 
368 
39 
372 
58 
388 
59 
388 
68 
400 
64 
524 
59 
528 
63 
545 
63 
546 
55 
552 
52 
555 
57 
558 
63 
Note: the first 45 survival times are complete (died within the study period) and the following 50 are censored.
Table 13–9. Lung cancer survival data: “low” vital capacity
Time 
Age 
Time 
Age 
Time 
Age 
Time 
Age 

Complete 

0 
55 
2 
75 
2 
73 
2 
65 
6 
61 
17 
74 
22 
51 
23 
66 
54 
67 
56 
51 
61 
36 
63 
54 
64 
54 
69 
70 
146 
53 
155 
47 
161 
46 
233 
41 
248 
61 
283 
53 
Censored 

47 
56 
73 
55 
86 
48 
89 
65 
91 
58 
169 
58 
172 
62 
177 
53 
183 
48 
188 
52 
194 
67 
266 
53 
266 
53 
267 
52 
351 
71 
372 
71 
Note: the first 20 survival times are complete (died within the study period) and the following 16 are censored.
The mean age of patients in the two groups differ (“high” vital capacity years and “low” vital capacity years), and age is undoubtedly related to survival, implying that adjustment (via a proportional hazards model) will help identify differences between the two vital capacity groups that are not attributable to differing influences from age. The assumption of proportional hazards allows adjustment for the influence of age and provides an improved evaluation of “high” and “low” capacity groups, while dramatically decreasing the impact of random variation incurred by stratifying the data into four age categories. As with most regression models, large gains in efficiency are achieved because the entire data set (all 131 observations) is focused on the estimation of two summary parameters.
Application of an additive proportional hazards model to these lung cancer survival data requires making vital capacity and age components of the model, thus producing separate measures of the influence of each variable on survival time variability. That is, the overall ratio of hazard functions is factored into a component measuring the influence of vital capacity and a component measuring the influence of age on survival. Such an additive proportional hazards model describing the relationships of vital capacity and age to survival is
Table 13–10. Summary: lung cancer mean survival time
Vital capacity 
Age (years) 
n_{i} 
Complete 
Censored 
Mean 
SE 

“High” 
≤ 65 
68 
27 
41 
578.0 
111.3 
“High” 
> 65 
27 
18 
9 
235.1 
55.4 
“Low” 
≤ 65 
27 
14 
13 
255.3 
68.2 
“Low” 
> 65 
9 
6 
3 
180.7 
73.8 
Total 
131 
65 
66 
376.9 
46.7 
Note: vital capacities “high” and “low” are in quotes as a reminder that these categories are arbitrarily chosen.
Table 13–11. Three proportional hazards models: lung cancer survival data
Term 
Coefficient 
SE 
pvalue 
Hazard ratio 


Vital capacity group and age included 

Group 
0.637 
0.275 
0.020 
1.891 

Age 
0.038 
0.015 
0.013 
1.034 

−2 loglikelihood = 549.646; number of model parameters = 2. 

Age excluded Group 
0.540 
0.274 
0.049 
1.716 

−2 loglikelihood = 556.093; number of model parameters = 1. 

Vital capacity group excluded 

Age 
0.034 
0.016 
0.027 
1.035 
−2 loglikelihood = 554.605; number of model parameters = 1.
Contrasting the additive bivariate model (group and age included) to the model with age excluded (reduced model) shows noticeable confounder bias. The coefficient associated with the group membership (b _{1}) decreases from 0.637 to 0.540 when age is excluded from the model; in terms of the relative hazard, the change is 1.891 to 1.716. Also, the statistical evaluation shows that age has a strong influence on the survival time. The increase in loglikelihood statistics has an approximate chisquare distribution with one degree of freedom (the difference in the number of parameters needed to specify each of the two models) when age is not related to survival (b _{2} = 0). The pvalue is P(X ^{2} ≥ 6.447  b _{2} = 0) = 0.011. Both the extent of the confounder bias and the expected association with survival time indicate that age plays a significant role in the survival of these patients and is an important component in a model designed to identify the influence of vital capacity on survival time.
The influence of the “high” versus “low” vital capacity is similarly assessed. The comparison of the loglikelihood statistics (bivariate versus reduced model) produces an independent statistical evaluation of the influence of the vital capacity classification (difference between loglikelihood statistics is ). This increase has a chisquare distribution with one degree of freedom when vital capacity is unrelated to survival, yielding a pvalue of P(X ^{2} ≥ 4.959  b _{1} = 0) = 0.026. Like age, the classification of individuals by vital capacity is likely associated with survival time. The increase in the loglikelihood statistic, furthermore, cannot be attributed to influences of age because a measure of the independent ageeffect is maintained in both the bivariate and reduced analyses (x _{2} is included in both models).
Confidence intervals based on the estimated coefficients from the proportional hazard model are constructed in the usual way ( is an approximate 95% confidence interval for the underlying coefficient). Confidence (p.431) intervals for the relative hazard are then . The approximate 95% confidence interval based on the estimated vial capacity coefficient is and , and the corresponding confidence interval based on the estimated relative hazard of is (e ^{0.134}, e ^{1.212}) or (1.103, 3.241). The analogous approximate 95% confidence interval based on the estimated relative hazard associated with the age variable is (1.009, 1.070).
To explore further the lung cancer survival data, suppose that the “high” vital capacity group is chosen as a “baseline” survival function S _{0}(t). The productlimit estimated survival curve Ŝ _{0}(t) is given in Table 13–12 and displayed in Figure 13–6.
The survival curves for the “high” and “low” vital capacity groups (ignoring age),
Table 13–12. Lung cancer survival data: survival curve (“high” capacity) Ŝ _{0}(t)
Obs 
Days 
S _{0}(t) 
Obs 
Days 
S _{0}(t) 

1 
1 
0.968 
23 
139 
0.720 
2 
3 
0.958 
24 
143 
0.707 
3 
4 
0.947 
25 
159 
0.694 
4 
5 
0.937 
26 
168 
0.680 
5 
9 
0.926 
27 
170 
0.666 
6 
19 
0.916 
28 
180 
0.652 
7 
21 
0.905 
29 
189 
0.637 
8 
30 
0.895 
30 
192 
0.622 
9 
36 
0.884 
31 
201 
0.606 
10 
39 
0.874 
32 
212 
0.590 
11 
40 
0.863 
33 
223 
0.575 
12 
48 
0.853 
34 
229 
0.559 
13 
51 
0.842 
35 
238 
0.544 
14 
61 
0.832 
36 
265 
0.527 
15 
89 
0.820 
37 
275 
0.510 
16 
90 
0.808 
38 
292 
0.493 
17 
90 
0.796 
39 
317 
0.476 
18 
92 
0.784 
40 
322 
0.459 
19 
113 
0.771 
41 
350 
0.439 
20 
127 
0.759 
42 
357 
0.418 
21 
131 
0.746 
43 
380 
0.380 
22 
138 
0.733 
The ratio of two hazard functions summarizes the survival experience of two groups or individuals and is a particularly meaningful description of two proportional hazard functions. The estimated ratio of hazard rates from the proportional hazards model is analogous to ratios of average mortality or incidence rates except that these rates are instantaneous measures and are typically adjusted for the influence of other explanatory variables.
For the proportional hazards model, the difference in survival measured by the ratio of the two hazards functions is summarized by
(p.434) The ratio of hazard functions is analogous to an odds ratio estimated from an additive logistic model. The value indicates the relative influence of the jthvariable on survival independent of other explanatory variables, when an additive model represents the relationships between explanatory variables and the hazard rate. The relative hazard, for example, associated with vital capacity group membership is e ^{0.637} = 1.891. That is, the hazard rate in the “low” vital capacity group is a little less than twice that of the “high” vital capacity group, regardless of the ages of the individuals compared. Similarly, the independent influence of age on the hazard rates for these lung cancer patients, regardless of group membership, is e ^{0.038(age2−age1)} where age_{1} and age_{2} are compared. For example, for age_{1} = 55 and age_{2} = 75, the relative hazard is e ^{0.038(20)} = 2.138. Or, the lung cancer patients age 75 are at about twice the risk (measured by comparing hazard rates) as patients age 55 within both vital capacity groups.
The additive nature of a proportional hazards model dictates that vital capacity group membership and age do not interact. Therefore, a 55yearold member of the “high” vital capacity group compared to a 75yearold member of the “low” vital capacity group yields an estimated hazard ratio of e ^{0.637+0.038(20)} = 1.891(2.138) = 4.043, demonstrating a specific partitioning of the overall hazard ratio (4.043) into relative components (vital capacity group status = 1.891 and age = 2.138). Furthermore, these comparisons reflect the impact of vital capacity and age independent of the time of observation. Of course, these efficient and parsimonious summaries of the relationships between vital capacity, age, and survival time are properties of the model and only when the model adequately represents the data do they reflect the parallel relationships among the observed variables.
Application IV: Histologic Type, Treatment, and Lung Cancer Survival
A series of 137 patients with advanced lung cancer categorized by histologic type [8] provides an opportunity to apply a proportional hazards model that requires that distinct but nonnumeric categories be taken into account. The data consist of individual survival times (days) classified by one of four lung cancer histologic types (squamous cell, small cell, adenocarcinoma, and large cell) and by a new or standard treatment (x _{1} = 0 for the new treatment and x _{1} = 1 for the standard treatment). Three other explanatory variables are also recorded for each patient: a general medical status index (x _{2}), months from diagnosis to the start of the study (x _{3}), and age (x _{4}). These data are given in Tables 13–13 and 13–14.
An analysis to determine whether or not a proportional hazards model is an accurate representation of the data is not presented. Statistical tests to assess the assumption of proportionality are part of several “package” computer programs (p.435)
Table 13–13. Lung cancer by type: new treatment (x _{1} = 0)
Squamous cell 
Small cell 
Adenocarcinoma 
Large cell 


Time 
x _{2} 
x _{3} 
x _{4} 
Time 
x _{2} 
x _{3} 
x _{4} 
Time 
x _{2} 
x _{3} 
x _{4} 
Time 
x _{2} 
x _{3} 
x _{4} 
999 
90 
12 
54 
25 
30 
2 
69 
24 
40 
2 
60 
52 
60 
4 
45 
122 
80 
6 
60 
103^{+} 
70 
22 
36 
18 
40 
5 
69 
164 
70 
15 
68 
87^{+} 
80 
3 
48 
21 
20 
4 
71 
83^{+} 
9 
3 
57 
19 
30 
4 
39 
231^{+} 
50 
8 
52 
13 
30 
2 
62 
31 
80 
3 
39 
53 
60 
12 
66 
242 
50 
1 
70 
87 
60 
2 
60 
51 
60 
5 
62 
15 
30 
5 
63 
991 
70 
7 
50 
2 
40 
36 
44 
90 
60 
22 
50 
43 
60 
11 
49 
111 
70 
3 
62 
20 
30 
9 
54 
52 
60 
3 
43 
340 
80 
10 
64 
1 
20 
21 
65 
7 
20 
11 
66 
73 
60 
3 
70 
133 
75 
1 
65 
587 
60 
3 
58 
24 
60 
8 
49 
8 
50 
5 
66 
111 
60 
5 
64 
389 
90 
2 
62 
99 
70 
3 
72 
36 
70 
8 
61 
231 
70 
18 
67 
33 
30 
6 
64 
8 
80 
2 
68 
48 
10 
4 
81 
378 
80 
4 
65 
25 
20 
36 
63 
99 
85 
4 
62 
7 
40 
4 
58 
49 
30 
3 
37 
357 
70 
13 
58 
61 
70 
2 
71 
140 
70 
3 
63 
— 
— 
— 
— 
467 
90 
2 
64 
95 
70 
1 
61 
186 
90 
3 
60 
— 
— 
— 
— 
201 
80 
28 
52 
80 
50 
17 
71 
84 
80 
4 
62 
— 
— 
— 
— 
1 
50 
7 
35 
51 
30 
87 
59 
19 
50 
10 
42 
— 
— 
— 
— 
30 
70 
11 
63 
29 
40 
8 
67 
45 
40 
3 
69 
— 
— 
— 
— 
44 
60 
13 
70 
25 
70 
2 
6 
80 
40 
4 
63 
— 
— 
— 
— 
283 
90 
2 
51 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
15 
50 
13 
40 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
Table 13–14. Lung cancer by type: standard treatment (x _{1} = 1)
Squamous cell 
Small cell 
Adenocarcinoma 
Large cell 


Time 
x _{2} 
x _{3} 
x _{4} 
Time 
x _{2} 
x _{3} 
x _{4} 
Time 
x _{2} 
x _{3} 
x _{4} 
Time 
x _{2} 
x _{3} 
x _{4} 
72 
60 
7 
69 
30 
60 
3 
61 
8 
20 
19 
61 
177 
50 
16 
66 
411 
60 
5 
64 
384 
60 
9 
42 
92 
70 
10 
60 
162 
80 
5 
62 
228 
60 
3 
38 
4 
40 
2 
35 
35 
40 
6 
62 
216 
50 
15 
52 
126 
60 
9 
63 
54 
80 
4 
63 
117 
80 
2 
38 
553 
70 
2 
47 
118 
70 
11 
65 
13 
60 
4 
56 
132 
80 
5 
50 
278 
60 
12 
63 
10 
20 
5 
49 
123^{+} 
40 
3 
55 
12 
50 
4 
63 
12 
40 
12 
68 
82 
40 
10 
69 
97^{+} 
60 
5 
67 
162 
80 
5 
64 
260 
80 
5 
45 
110 
80 
29 
68 
153 
60 
14 
63 
3 
30 
3 
43 
200 
80 
12 
41 
314 
50 
18 
43 
59 
30 
2 
65 
95 
80 
4 
34 
156 
70 
2 
60 
100^{+} 
70 
6 
70 
117 
80 
3 
46 
— 
— 
— 
— 
182^{+} 
90 
2 
62 
42 
60 
4 
81 
16 
30 
4 
53 
— 
— 
— 
— 
143 
90 
8 
60 
8 
40 
58 
63 
151 
50 
12 
69 
— 
— 
— 
— 
105 
80 
11 
66 
144 
30 
4 
63 
22 
60 
4 
68 
— 
— 
— 
— 
103 
80 
5 
38 
25^{+} 
80 
9 
52 
56 
80 
12 
43 
— 
— 
— 
— 
250 
70 
8 
53 
11 
70 
11 
48 
21 
40 
2 
55 
— 
— 
— 
— 
100 
60 
13 
37 
— 
— 
— 
— 
18 
20 
15 
42 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
139 
80 
2 
64 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
20 
30 
5 
65 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
31 
75 
3 
65 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
52 
70 
2 
55 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
287 
60 
25 
66 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
18 
30 
4 
60 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
51 
60 
1 
67 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
122 
80 
28 
53 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
27 
60 
8 
62 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
54 
70 
1 
67 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
7 
50 
7 
72 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
63 
50 
11 
48 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
392 
40 
4 
68 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
— 
10 
40 
23 
67 
— 
— 
— 
— 
— 
— 
— 
— 
To begin to understand the influence of the explanatory variables on survival time among the lung cancer patients, an additive proportional hazards model is proposed that includes all five explanatory variables. The proportional hazards model is fundamentally the same as the previously additive models, except that a design variable indicates the four histologic categories. That is, a threecomponent design variable (z _{1}, z _{2}, and z _{3}) identifies four histologic types: z _{1} = 1 if the cancer type is small cell with z _{2} = z _{3} = 0; z _{2} = 1 if the cancer type is adenocarcinoma with z _{1} = z _{3} = 0; z _{3} = 1 if the cancer type is large cell, with z _{1} = z _{2} = 0; and squamous cell carcinoma is established as the baseline reference group by setting z _{1} = z _{2} = z _{3} = 0. An additive proportional hazards model incorporating the five explanatory variables is
The comparison of loglikelihood values from the full model and the model with the months from diagnosis (x _{3}) and age (x _{4}) variables removed (reduced model) shows that these two variables add little to the description of the survival times of the lung cancer patients. Comparison of the respective loglikelihood statistics (X ^{2} = 918.101 − 916.335 = 1.766 with two degrees of freedom, yielding a pvalue = P(X ^{2} ≥ 1.766  b _{3} = b _{4} = 0) = 0.414 produces no statistical evidence that these two explanatory variables are useful contributors to the study of the survival of these patients. The medical status index (x _{1}), however, appears worth including in the analysis (z = −5.952 with a pvalue < 0.001). The same is true for histologic type. Excluding the cancer histologic type (c _{1} = c _{2} = c _{3} = 0) substantially increases the loglikelihood statistic over the fivevariable model (Table 13–15) producing a likely nonrandom difference in loglikelihood values: X ^{2} = 936.722 − 918.101 = 18.621 with three degrees of freedom, yielding a pvalue = P(X ^{2} ≥ 18.621  c _{1} = c _{2} = c _{3} = 0) < 0.001. The treatment variable coefficient indicates that treatment status (old versus new) is marginally important in explaining the differences in survival times between these two groups of patients when adjusted for medical status and histologic type. The hazard rate associated the standard treatment divided by the hazard rate associated with the new treatment is e ^{0.334} = 1.400 (relative hazard). The value produces a pvalue of P( Z  ≥ 1.684  b _{1} = 0) = 0.092 when the medical status index and the histologic types are maintained in the model.
Table 13–15. Proportional hazards model: histologyspecific lung cancer survival data
Term 
Coefficient 
SE 
pvalue 
Hazard ratio 


Full model 

Small cell 
ĉ _{1} 
0.884 
0.268 
< 0.001 
2.421 
Adenocarcinoma 
ĉ _{2} 
1.170 
0.296 
< 0.001 
3.223 
Large cell 
ĉ _{3} 
0.372 
0.280 
0.184 
1.450 
Treatment 
0.385 
0.205 
0.061 
1.470 

Status 
−0.033 
0.006 
< 0.001 
0.968 

Months 
0.001 
0.008 
0.913 
1.001 

Age 
−0.012 
0.009 
0.188 
0.988 

−2 loglikelihood = 916.335; number of model parameters = 7. 

Months and age excluded (b _{3} = b _{4} = 0) 

Small cell 
ĉ _{1} 
0.848 
0.264 
< 0.001 
2.334 
Adenocarcinoma 
ĉ _{2} 
1.134 
0.293 
< 0.001 
3.109 
Large cell 
ĉ _{3} 
0.361 
0.279 
0.195 
1.435 
Treatment 
0.334 
0.199 
0.094 
1.400 

Status 
−0.031 
0.005 
< 0.001 
0.970 

−2 loglikelihood = 918.101; number of model parameters = 5. 

Months, age and histology type excluded (b _{3} = b _{4} = c _{1} = c _{2} = c _{3} = 0) 

Treatment 
0.239 
0.182 
0.190 
1.270 

Status 
−0.033 
0.005 
< 0.001 
0.967 
−2 loglikelihood = 936.722; number of model parameters = 2.
To describe these data further, it is assumed that the hazard rates are at least approximately constant over the range of the followup period. This additional assumption yields a model in terms of survival probabilities as
Comparisons of these estimated mean survival times describe the relative influences of the histologic type and treatment (new versus standard) on the estimated (p.439)
Table 13–16. Lung cancer mean survival times: both treatments and four histologic types for patients with average level of medical status
Squamous 
Small 
Adeno 
Large 


New (x _{1} = 0) 
193.52 
82.88 
62.26 
134.88 
Standard (x _{1} = 1) 
138.92 
59.35 
44.58 
96.58 
model:
DEPENDENCY ON FOLLOWUP TIME
A proportional hazards model does not require the relationship between followup time and survival to be specified in detail. The estimated relative hazard measures the separate influence of each explanatory variable on survival, free of any confounding influence of time as long as the hazard functions are proportional and the model is additive. It is instructive to describe a situation in which a dependency exists between followup time and a measure of survival. Two perspectives provide brief illustrations. A simple model illustrating the dependency of the odds ratio on followup time is presented, followed by a comparison of the logistic regression model (no influence of followup time) to the proportional hazard regression model (accounting for any influence of followup time) using the same data.
Odds Ratio Model
Consider two groups with constant hazard rates (exponential survival) given by λ_{1} and λ2. The survival probabilities associated with these two groups at a specific time are and .
The odds ratio measuring the relative differences in survival for these two groups at time t is
(p.440) As followup time increases, the odds ratio increases. This increase is large (very large) for a hazard rate of above 0.1 (top of Fig. 13–8). The structure of the odds ratio is such that it is forced to become large and difficult to interpret as followup time increases. However, for small and more typical hazard rates (in the neighborhood of 0.001), followup time has a less dramatic influence on the odds ratio (bottom of Fig. 13–8). For hazard rates in the range normally observed in human populations, the odds ratio, nevertheless, increases over time in an approximately linear pattern with a slope proportional to the difference between the two hazard rates. Specifically,
Western Collaborative Group Study Data
The prospectively collected data from the Western Collaborative Group Study (WCGS) includes the time from admission to the study to the time of a coronary event or withdrawal, making it possible to calculate the followup times for 3154 participants (Appendix A). Of these study participants, 257 coronary events occurred, and the remaining 92% of the sample were either lost to followup (16%) or withdrawn from followup because they were diseasefree (76% censored). A proportional hazards model applied to these data shows the influence of eight risk factors on CHD “survival” time (CHDfree time). Table 13–17 gives both the estimated coefficients for the additive proportional hazards model and for the parallel additive logistic model (Table 8–13) applied to the same WCGS data.
Multiple logistic model estimates are derived from binary outcomes (CHD event or no CHD events) disregarding time of occurrence. That is, the logistic analysis does not use survival time information. For example, a coronary event that occurs early in a study is given the same weight as a later coronary event.
Table 13–17. A comparison of the proportional hazards model and the logistic model (WCGS data)
“Cox” model 
Logistic model 


Factor 
SE 
SE 

Age 
0.063 
0.011 
0.065 
0.012 
Height 
0.015 
0.031 
0.016 
0.033 
Weight 
0.007 
0.004 
0.008 
0.004 
Systolic bp 
0.014 
0.006 
0.018 
0.006 
Diastolic bp 
0.008 
0.010 
−0.002 
0.010 
Cholesterol 
0.009 
0.001 
0.011 
0.002 
Smoking 
0.021 
0.004 
0.021 
0.004 
A/B 
0.671 
0.137 
0.653 
0.145 
The logistic and proportional hazards models when applied to the WCGS data produce essentially the same results for two reasons (Table 13–17). Coronary events occurred only among a small proportion (8%) of the study subjects (92% of the subjects were censored or lost), producing relatively little information on the followup time and CHD events. Also, the eight risk variables, measured only once at the beginning of the study, changed very little during the years of followup. For example, height did not change at all, and the smoking and behavior variables were essentially constant. Therefore, the explanatory variables are not strongly associated with time to a coronary event.