Jump to ContentJump to Main Navigation
Survival Analysis$

Shenyang Guo

Print publication date: 2009

Print ISBN-13: 9780195337518

Published to Oxford Scholarship Online: January 2010

DOI: 10.1093/acprof:oso/9780195337518.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2019. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in OSO for personal use (for details see www.oxfordscholarship.com/page/privacy-policy). Subscriber: null; date: 22 April 2019

Multilevel Analysis of Time-to-Event Data

Multilevel Analysis of Time-to-Event Data

(p.116) 6 Multilevel Analysis of Time-to-Event Data
Survival Analysis

Guo Shenyang

Oxford University Press

Abstract and Keywords

This chapter focuses on a relatively new class of survival models: the multilevel approaches to time-to-event data. It first reviews the importance of conducting multilevel analysis in social work research and the recent advances in biostatistics to correct for autocorrelated survival times. It then describes the independent assumption embedded in the Cox proportional hazards model and details the negative consequence of inclusion of autocorrelated data. It reviews available biomedical research models that correct for autocorrelation, particularly the WLW method. Finally, using empirical data, the chapter illustrates how to diagnose the presence of autocorrelation and how to use one such corrective model.

Keywords:   survival models, time-to-event data, multilevel analysis, Cox proportional hazards model, autocorrelation

This chapter focuses on a relatively new class of survival models: the multilevel approaches to time-to-event data. In this chapter, I first review the importance of conducting multilevel analysis in social work research and the recent advances in biostatistics to correct for autocorrelated survival times. Next, I describe the independent assumption embedded in the Cox proportional hazards model and detail the negative consequence of inclusion of autocorrelated data. I then review available biomedical research models that correct for autocorrelation, particularly, the WLW method. Finally, using empirical data, I illustrate how to diagnose the presence of autocorrelation and how to use one such corrective model. Most of the content presented in this chapter is based on Guo and Wells (2003).

Grouped Data and Significance of Conducting Multilevel Analysis

Social work researchers often encounter grouped or multilevel data in which individuals are nested within families, and families are nested within neighborhoods. Analyzing such data requires special treatment (p.117) because most multivariate models assume that observations are independent, and grouped data clearly violate this assumption.

Sibling-group data are such an example of violating the independent assumption. These types of data often include children from the same family who exit or reenter foster care at roughly the same time. Placement of siblings in the same home is mandated by many states or is the preference of many public agencies (Hegar, 1988; Smith, 1996).

A more important reason for conducting multilevel modeling is substantive: researchers need to test how individual characteristics interact with family characteristics, or how clients’ characteristics interact with agency characteristics, and therefore, to test joint effects of the two-level characteristics on the outcome variable. For instance, in child welfare research, an important research question is this: what is the joint impact of a child’s receipt of welfare and his or her mother’s receipt of welfare on foster care outcomes? How are agency performance outcomes related to local county and agency factors as well as individual factors of the child? This type of cross-level interaction, or propositions about macro-to-micro relations, cannot be answered by investigations using conventional approaches.

Statisticians and biomedical researchers identified adverse consequences of applying the Cox regression to grouped survival times (Andersen & Gill, 1982; Prentice, Williams, & Peterson, 1981). They noted that when the independent assumption of the Cox model is violated, the tests of statistical significance are biased and in ways that cannot be predicted beforehand (Wei, Lin, & Weissfeld, 1989). Significant progress has been made toward a solution to the problem of nonindependent event times. Several approaches have been applied in biomedical research (Andersen & Gill, 1982; Lee, Wei, & Amato, 1992; Liang, Self, & Chang, 1993; Prentice et al., 1981; Wei et al., 1989).

Consequences of Using Autocorrelated Data

To detail the consequences of including autocorrelated data in the Cox proportional hazards model, it is useful to clarify a primary assumption underlying the method. The Cox regression, like all regression-type models, assumes that use of the model with data from the same unit (such as a person, a sibling group, a family, or an organization) violates (p.118) the assumption about independent event data. This is because data from the same unit tend to be more alike than data from independent units (Allison, 1995). Use of the Cox model without appropriately applying a procedure to correct for autocorrelated data causes the estimation procedure to assume the sample contains more information than it actually has (Allison, 1995, p. 240).

The major consequence of including autocorrelated data in the Cox proportional hazards model is that the tests of statistical significance may be misleading. Studies using both real data and Monte Carlo experiments show that the standard errors produced by the Cox proportional hazards model are biased downward and that test statistics produced by the model are biased upward (Allison, 1995; Lin, 1994; Wei et al., 1989). As a result, the Cox model may identify some independent variables as statistically significant that are, in fact, statistically insignificant.

Grouped data may also lead to informative censoring. For instance, when a foster care study uses a sample that contains a high proportion of children who are siblings, the noninformative censoring assumption is very likely to be violated. This is because children from the same sibling group tend to have same length of time in foster care and a common outcome. Therefore, if one child in a sibling group is randomly censored, his or her siblings will also be randomly censored.

Diagnostic Procedures

In spite of these consequences, no definitive tests have been developed to assess the severity of autocorrelation in a given study. Following the work of Allison (1995, chapter 8), I summarize below three procedures that may be used to diagnose the severity of autocorrelation in a given study. In the following exposition, I take data with sibling groups as an illustrating example.

1. Assess the Scope of the Problem

The first strategy is to examine study data to evaluate whether the presence of sibling-group data is a problem. Examination involves identification of the proportion of children with siblings in the sample. If the proportion is small, for example, 10% or less, the issue of autocorrelated (p.119) data can be ignored. If the proportion is not small, for example, more than 10%, the researcher needs to examine the beginning and ending dates for the event under study. These may be the date on which a child leaves foster care and returns home. If the dates differ for siblings, the issue of autocorrelated data can be ignored. If the dates are frequently approximately the same, one needs to calculate the standard deviation of the length of time in care (or at home) within each sibling group. If the within-group standard deviations are small, on average, the sample’s intragroup correlation is likely to be high, and a corrective procedure should be used in the analysis.

2. Assess the Intragroup Correlation

The second strategy is to assess directly the intra-sibling-group correlation by running hierarchical linear modeling, or HLM. (Before running the analysis, one needs to change the values of event times for censored subjects for whom the event under study occurred after the end of the study window to the length of the study window.)1 The intragroup correlation is defined as the proportion of the variance in the transformed event times that is between groups (Raudenbush & Bryk, 2002). The intragroup correlation can be calculated by dividing the between-group variance by the sum of the between-group variance and the within-group variance. One can obtain the between-group and within-group variances by running a one-way ANOVA with random effects model in HLM. A high intragroup correlation, a correlation greater than .5, for example, is an indication that a considerable proportion of the variation in timing of an event is due to groups and that a corrected Cox model should be used in the analysis.

3. Assess the Effect of an Omitted Sibling’s Length of Time in Care

The third strategy is to use the event time of a sibling omitted from each sibling group as one of the predictors in the Cox model. The model should also include any covariates one would otherwise include in the model because the question is whether there is residual autocorrelation after the effects of the covariates have been removed. This analysis is performed on a subset of the sample, that is, on only those who have a (p.120) sibling in the sample. A significant coefficient for the event time of omitted siblings would indicate a high degree of intragroup correlation among the sibling groups, and a corrected model should be used in the analysis.

Overview of the Multilevel Approaches

A variety of models have been developed to correct for the problems associated with autocorrelation. Terry M. Therneau and Patricia M. Grambsch (2000) and Philip Hougaard (2000) provide comprehensive evaluations of recent developments, the types of data for which each is best suited, and the computer software that is available for each one.

There are two types of autocorrelated multivariate event times. These are clustered event data, or correlated event times among subjects from the same group, and repeated event data, or the times the same event occurs more than once to the same subject.

Models to correct for autocorrelation induced by clustered event data fall into two broad categories: the frailty models and the marginal models. The frailty models include random effects to represent extra heterogeneity of the unit that gives rise to the dependence of event times (Clayton & Cuzick, 1985; Guo & Rodriguez, 1992; Hougaard, 1986, 1987; Klein, 1992; Nielsen, Gill, Andersen, & Sorensen, 1992; Oakes, 1989, 1992). In order to use frailty models, one must specify correctly a parametric distribution of the frailty (Lin, 1994), which is often unknown to researchers. The frailty models are best suited to clinical trials involving random selection of subjects or to samples involving matched-pairs covariates (Hougaard, 2000). In this text, I focus on marginal models because frailty models require stronger and more restrictive assumptions.

The following two models are developed to handle repeated events data: (a) the AG multiplicative intensity model (Andersen & Gill, 1982); under the specification of this model, the risk of a recurrent event for a subject satisfies the usual proportional hazards model and is unaffected by earlier events that occurred to the subjects unless terms that capture such dependence are included explicitly in the model as covariates (Lin, 1994); (b) the PWP Model (Prentice et al., 1981)—this model differs from the AG model in two aspects: the risk sets for the (k+1)th (p.121) recurrences are restricted to the individuals who have experienced the first k recurrences, and the underlying intensity functions and regression parameters are allowed to vary among distinct recurrences (Wei et al., 1989).

Several computer software programs are available for fitting both frailty and marginal models. The statistical software package S-Plus offers an array of functions for fitting both frailty and marginal models. Several SAS macros, along with most of the data sets discussed in Therneau and Grambsch (2000), can be found in the companion Web page of their book. Several SAS macros developed by Allison can be downloaded from the SAS Web page. The SAS Proc Phreg and Stata stcox procedures offer the estimation of marginal approaches.

The Marginal Models

The marginal models have much in common with the generalized estimating equation (GEE) approach of Zeger, Liang, and Albert (1988). Two marginal models are especially useful to the multilevel analysis of survival data: the WLW model (Wei et al., 1989), and the LWA model (Lee et al., 1992). These models are designated by the first initials of their developers’ last names.2

The marginal models offer several advantages for multilevel analysis. They are flexible in that they do not require assumptions about the nature or structure of the dependence in correlated event times (Allison, 1995; Wei et al., 1989). They are applicable to moderate-sized samples (Wei et al., 1989).3 They are consistent with the conventional Cox regression in the sense that the conventional model is a special case of the marginal models (Hougaard, 2000). The computation of a marginal model is relatively simple once a data set has been created (Therneau & Grambsch, 2000).

In general, both the WLW and the LWA models make no more assumptions about the data than does the conventional Cox regression. The fundamental difference between the WLW and LWA models is the way each handles the baseline hazard function (i.e., h0(t) in equation (4.1)). The WLW model allows the baseline hazard function to vary among types of multivariate event times and to consider type-specific regression parameters. By way of contrast, the LWA model postulates a (p.122) common baseline hazard function for all types of event times (Lin, 1994).

Both models can be employed to facilitate multilevel analysis. The WLW model can be used whether subjects within groups have common or divergent baseline hazard rates whereas the LWA model can be used only when they have common baseline hazard rates. I now use the WLW model as one example to illustrate the methodology of correcting autocorrelated event times.

To correct for biases in standard errors and to estimate parameters, the WLW procedure runs a series of Cox regression models. To run these models, the subjects within groups must be organized randomly. The procedure requires that the investigator identify the subjects to be analyzed in the first and succeeding models. Taking the sibling group data as an example, the first model is for the first child selected from all sibling groups, the second is for the second child from all groups, and so on. The estimating procedure continues in this fashion until the number of children gets too small to estimate a model reliably. Based on the estimated variances, the WLW procedure then estimates the marginal distributions of the distinct event times in order to yield a robust and optimal estimation of the variance-covariance matrix. This variance-covariance matrix is then used in statistical testing. Standard errors from this matrix are usually larger than those estimated by the uncorrected model. Therefore, when autocorrelation is present, variables that are significant in the uncorrected Cox model may become insignificant in the Cox model corrected with the WLW procedure.4

Since the WLW model is designed to correct for biases in standard errors, the estimated coefficients from this procedure are not expected to differ in size from those produced by the uncorrected Cox model when both models are constructed with data from the same subjects (Allison, 1995). (If the estimated coefficients do differ, they can generally be ignored.) One limitation of the WLW procedure is that some subjects in groups will be excluded from the analysis when the distribution of group size is skewed.

To understand this limitation, suppose we analyze a sample of 40 children in three types of groups: (1) 10 groups of size one (i.e., single child without siblings), (2) 10 groups of size two (i.e., each group comprises two children coming from the same family), and (3) one group of size 10 (i.e., all 10 children coming from one family). The (p.123) WLW procedure will analyze all children from groups whose size is either one or two but only two children from the group of size 10. This is because when the WLW runs the third model for the third child from each sibling group, there is only one child left, that is, the third child from the group of size 10. The program WLW macro will stop at this point because the sample size becomes too small to reliably run a model. In this example, the WLW has to delete eight children or 20% of the sample from the study.

An Illustrating Example

In this illustration, we demonstrate how to use the diagnostic procedures described above. We also show the deficiencies of results from an uncorrected Cox model and results from a Cox model corrected by including a randomly selected child from each sibling group. We use data from an investigation designed to identify which of 12 factors are linked most strongly to timing of reunification of foster children within 18 months of entry into care (Wells & Guo, 2003). The study sample includes 525 children first placed in foster care over a 6-month period in the late 1990s.

1. Diagnosis of an Autocorrelation Problem

To illustrate how to diagnose the presence of autocorrelation in study data, we constructed a data set in which 58% of study children (n = 302) form 151 sibling groups each containing two siblings. The remaining children (n = 223) do not have a sibling in the sample. As a result, the distribution of the sibling group size, in this illustration, is not skewed. Our examination of the dates of entry into foster care and exit from foster care for subjects shows that children from the same sibling group often have the same dates of entry and exit. As a result, the length of time siblings spend in foster care is often the same. Of the 151 sibling groups, 116 or 76.8% have a standard deviation of zero for time spent in foster care prior to exit.

Then, we examined the degree of correlation among siblings’ event data. We changed the lengths of stay for children who were censored at the end of the 18-month study window to 18 months, and then we used (p.124) these transformed data to run a one-way ANOVA with random effects model in HLM. The between-group variance of length of stay is 33059.49, and the within group variance of length of stay is 5159.63. Together they yield an estimated intragroup correlation of 0.865. Therefore, the intragroup correlation in this data set is high: about 86.5% of the variation in length of stay in this sample is due to groups, and only 13.5% is due to individuals.

We also used the event time of a sibling who was omitted from the analysis as one of the predictors of timing of reunification. Following Allison’s suggestion (1995), we ran an uncorrected Cox model using 151 children in the sample. The other predictor variables used in the model include child age at entry, gender, ethnicity, health status at entry, reason for a child’s placement, the type of placement, mother’s metal health status, mother’s problem with substance abuse, and other mother’s welfare use and employment variables. The coefficient “Duration of an omitted sibling” is –0.012 (p<.0001), and the hazard ratio is exp(–0.012) = 0.988. This means that the omitted sibling’s length of stay is highly predictive of the length of stay of the child in the omitted sibling’s sibling group. The correlation is almost one: the hazard of reunification for any child included in the analysis is almost the same as the hazard of his or her omitted sibling, other things being equal.

As a result of these diagnostic procedures, one can conclude that the event data from sibling groups are highly correlated, that the assumption of the Cox proportional hazards model is violated, and that a correction for the biases introduced by these violations is necessary to have an efficient and a valid test of the statistical significance of each factor included in the model.

2. Demonstration of the Utility of the WLW Model

To demonstrate the utility of the WLW model, we conducted two comparative analyses. In the first analysis, we compared the results of the corrected Cox model (the WLW model) with those of the uncorrected Cox model (“Naïve” model 1). In the second comparison, we compared the results of the corrected Cox model (the WLW model) with those of a second naïve model, the random selection of one subject from a group to include in the analysis (Naïve model 2).

(p.125) We used the SAS WLW macro written by Allison (1995, pp. 242–243) to estimate the WLW model. This macro can be run on SAS release 6.10 or higher and requires use of the Proc IML module. We used the SAS Proc Phreg to estimate both of the naïve models.

The WLW Model and the Naïve Model 1. In the first comparison, both analyses use the same 525 subjects. As the results in Table 6.1 show, most standard errors estimated by the Naïve model 1 are lower than the corresponding errors estimated by the WLW Model. As a result, the p-values in the WLW model have also changed. Naïve model 1 identifies one variable, “Received TANF income and lost TANF income,” as statistically significant at the .001 level, and two variables, “Foster home” and “Percent time mother received TANF-pre-placement window,” as statistically significant at the .05 level. Each of these variables is statistically insignificant at the same level in the WLW model. On the other hand, Naïve model 1 identifies the variable “Group home or hospital” as statistically insignificant, and the WLW model identifies the variable as statistically significant at the .001 level.

Table 6.1 also shows that at an alpha level of .05, the naïve model identifies 8 of 19 variables as statistically significant, but the WLW Model identifies as statistically significant 6 of those 8 variables, plus one variable that is statistically insignificant in the Naïve model 1.

The WLW Model and the Naïve Model 2. In the second analysis, we compared the results from the WLW model with those from the second uncorrected Cox model (the Naïve model 2). These models use different numbers of subjects: for the corrected model, 525 subjects, and for the uncorrected model, 374 subjects, as a result of the inclusion of only one randomly selected child from each sibling group.

As data in Table 6.1 show, the Naïve model 2 identifies as statistically insignificant one variable, “Group home or hospital,” that is statistically significant at the .001 level in the WLW model. The Naïve model 2 identifies two variables as statistically insignificant, “Age 0” and “Mother mental problems: Presence,” that are statistically significant at the .05 level in the WLW model. Table 6.1 also shows that the hazard ratios in the two models differ by 5 percentage points or more on the following eight variables: “Age 1–3,” “Age 12–16,” “Female,” “Health problems at entry: Presence,” “Dependency,” “Reason for placement: Other,” “Group home or hospital,” and “Mother’s substance abuse: Presence.”


Table 6.1 Model Comparisons


Naïve Model 1: Same Set of Subjects as the WLW Model (n=525)

Naïve Model 2: Randomly Selected One Child from Each Sibling Group (n=374)

Robust Model WLW (n=525)




Hazard Ratio




Hazard Ratio




Hazard Ratio

Age at entry (in years)







































































African American













Health problems at entry















Reason for placement

Physical abuse








































First placement type

Kinship home

Foster home













Group home or hospital













Mother mental problems















Mother substance abuse















Percent time mother received  TANF - pre-placement  window













Mother's receipt/loss TANF/  wages - post-placement  window

Received TANF, no loss

Never received TANF income













Received TANF income, and  lost TANF income













Mother's average monthly total  income from TANF/wages -  post-placement window













Mother's percent average  monthly total income from  wages - post-placement  window













Note: Comparison categories are listed first for each categorical variable.

(***) Significant at .001 level,

(**) at .01 level,

(*) at .05 level, two-tailed test.


(p.128) In summary, these comparisons show substantial differences between the WLW model and the Naïve model 1 and between the WLW model and the Naïve model 2 with respect to tests of statistical significance. The WLW model also differs from the Naïve model 2 with respect to the magnitude of the hazard ratios. Moreover, the factors that affect timing of reunification to a statistically significant degree in the WLW model differ from those identified by the two naïve models. These comparisons reveal that applying uncorrected models to grouped survival data produces misleading results, and underscore the importance of controlling for autocorrelation explicitly in a multilevel analysis of time-to-event data.

In this chapter, I briefly review the statistical literature on developing multilevel approaches to analyzing time-to-event data and show the application of one of such model (i.e., the WLW model) to correcting for autocorrelations introduced by sibling groups. I caution, however, that use of the WLW model may not always be an improvement over an uncorrected model. When a model is misspecified, that is, a model in which important explanatory variables are not included, and the analysis deletes too many subjects, differences between a WLW model and an uncorrected Cox model disappear. Misspecification causes changes in the estimated parameters and standard errors as a function of unobserved heterogeneity (Allison, 1995).

Indeed, use of the WLW model may be inappropriate if too many subjects are excluded from the analysis because the sample that is included in the analysis no longer represents the entire sample of interest. It is uncertain, however, how large a reduction renders use of the WLW model invalid. If loss of subjects is a concern, investigators may use the LWA procedure to correct for autocorrelation when they can justify its use.


(1) The issue of intragroup correlation is more complicated in survival analysis than in other longitudinal studies because of censoring. The method of assigning maximum length of stay to subjects whose lengths of study are longer than the study window, as we recommend here, is an informal procedure.

(2) Liang et al. (1993) developed another marginal model, the “LSC” model, which is basically the same as the LWA model (Lin, 1994). It employs the same estimating function as that of the LWA, but replaces the ratio of survival functions in estimating the “score function” by an analog that exploits pairwise comparisons of independent observations.

(3) Wei et al. (1989) employ a sample of 36 patients with approximately three distinct event times for each patient.

(4) Lin’s Monte Carlo simulation shows that the LWA-type model tends to be more efficient than the WLW-type model when failure times are generated with a common baseline hazard function, but the difference is very small. His study using real clinical data shows that the LSC model produces parameter estimates very similar to those of the WLW model, and the standard error estimates are almost identical between the two (Lin, 1994).