## Judith D. Singer and John B. Willett

Print publication date: 2003

Print ISBN-13: 9780195152968

Published to Oxford Scholarship Online: September 2009

DOI: 10.1093/acprof:oso/9780195152968.001.0001

# Describing Discrete-Time Event Occurrence Data

Chapter:
(p. 325 ) 10 Describing Discrete-Time Event Occurrence Data
Source:
Applied Longitudinal Data Analysis
Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780195152968.003.0010

# Abstract and Keywords

This chapter presents a framework for describing discrete-time event occurrence data. Section 10.1 introduces the life table, the primary tool for describing event occurrence data. Section 10.2 introduces three essential statistical summaries of the life table—the hazard function, the survivor function, and the median lifetime—and demonstrates that these ingenious statistics, which deal evenhandedly with censored and noncensored cases, are intuitively appealing as well. Section 10.3 applies the new techniques to four empirical studies, to help you develop intuition about the behavior, interpretation, and interrelationships of life table methods. Section 10.4 focuses on sampling variation, showing how to estimate standard errors. Section 10.5 concludes by showing how to compute all these summaries using standard cross-tabulation programs available in every major statistical package.

Time does not change us, it just unfolds us.

—Max Frisch

Most empirical researchers are so comfortable conducting descriptive analyses that they rarely imagine that familiar statistical workhorses—such as means and standard deviations—may not always be suitable. As explained in chapter 9, censoring makes standard statistical tools inappropriate even for simple analyses of event occurrence data. A censored event time provides only partial information: it tells you only that the individual did not experience the target event by the time of censoring. In essence, it tells you more about event nonoccurrence than about event occurrence (the latter, of course, being your primary interest). Traditional statistical methods provide no ready way of simultaneously analyzing observed and censored event times. Survival methods do.

In this chapter, we present a framework for describing discrete-time event occurrence data. In addition to our primary agenda—which involves demonstrating how to implement the new methods and interpret their results—we have a secondary agenda: to lay the foundation for model building, the focus of chapters 11 and 12. As we will show, the conceptual linchpin for all subsequent survival methods is to approach the analysis on a period-by-period basis. This allows you to examine event occurrence sequentially among those individuals eligible to experience the event at each discrete point in time.

We begin in section 10.1 by introducing the life table, the primary tool for describing event occurrence data. Then, in section 10.2, we introduce three essential statistical summaries of the life table—the hazard function, the survivor function, and the median lifetime—and demonstrate that these ingenious statistics, which deal evenhandedly with censored and noncensored cases, are intuitively appealing as well. In section (p. 326 ) 10.3, we apply the new techniques to four empirical studies, to help you develop intuition about the behavior, interpretation, and interrelationships of life table methods. Sampling variation is the topic of section 10.4, where we show how to estimate standard errors. We conclude, in section 10.5, by showing you how to compute all these summaries using standard cross-tabulation programs available in every major statistical package.

# 10.1 The Life Table

The fundamental tool for summarizing the sample distribution of event occurrence is the life table. As befits its name, a life table tracks the event histories (the “lives”) of a sample of individuals from the beginning of time (when no one has yet experienced the target event) through the end of data collection. Table 10.1 presents a life table for the special educator data set introduced in section 9.1.2. Recall that this study tracked the careers of 3941 teachers newly hired in the Michigan public schools between 1972 and 1978. Everyone was followed until 1985, when data collection ended. Defining the “beginning of time” as the teacher’s date of hire, research interest centers on whether and, if so, when these special educators stopped teaching.

Divided into a series of rows indexing time intervals (identified in columns 1 and 2), a life table includes information on the number of individuals who:

• Entered the interval (column 3)

• Experienced the target event during the interval (column 4)

• Were censored at the end of the interval (column 5)

Here, these columns tally, for each year of the career, the number of teachers employed at the beginning of the year, the number who stopped teaching during the year, and the number who were censored at the end of the year (who were still teaching when data collection ended). We discuss the remaining elements of the life table in section 10.2.

Taken together, these columns provide a narrative history of event occurrence over time. At the “beginning of time,” when everyone was hired, all 3941 teachers were employed. During the first year, 456 teachers quit, leaving 3485 (3941 – 456) to enter the next interval, year 2. During the second year, 384 teachers quit, leaving 3101 (3485 – 384) to enter the next interval, year 3. During the 7th year, censoring begins to affect the narrative. Of the 2045 special educators who taught continuously for 7 years, 123 quit by the end of that year and 280 were (p. 327 )

Table 10.1: Life table describing the number of years in teaching for a sample of 3941 special educators

Number

Proportion of

Year

Time interval

Employed at the beginning of the year

Who left during the year

Censored at the end of the year

Teachers at the beginning of the year who left during the year

All teachers still employed at the end of the year

0

[0,1)

3941

1.0000

1

[1,2)

3941

456

0

0.1157

0.8843

2

[2,3)

3485

384

0

0.1102

0.7869

3

[3,4)

3101

359

0

0.1158

0.6958

4

[4,5)

2742

295

0

0.1076

0.6209

5

[5,6)

2447

218

0

0.0891

0.5656

6

[6,7)

2229

184

0

0.0825

0.5189

7

[7,8)

2045

123

280

0.0601

0.4877

8

[8,9)

1642

79

307

0.0481

0.4642

9

[9, 10)

1256

53

255

0.0422

0.4446

10

[10,11)

948

35

265

0.0369

0.4282

11

[11,12)

648

16

241

0.0247

0.4177

12

[12,13)

391

5

386

0.0128

0.4123

Risk set

Hazard function

Survivor function

(p. 328 ) censored. This left only 1642 teachers (2045 – 123 – 280) to enter their 8th year and of these, 79 quit by the end of that year and 307 were censored. In the later rows of the life table, censoring exacts a heavy toll on our knowledge about event occurrence. Among the 648 special educators still teaching at the beginning of year 11, for example, only 16 quit by the end of the year but 241 (15 times as many) were censored. All told, this life table describes the event histories for 22,668 “person-years:” 3941 year 1’s, 3485 year 2’s, up through 391 year 12s.

Like all life tables, table 10.1 divides continuous time into a series of contiguous intervals. In column 1, we label these intervals using ordinal numbers; in column 2, we define precisely which event times appear in each. The intervals in table 10.1 reflect a standard partition of time, in which each interval includes the initial time and excludes the concluding time. Adopting common mathematical notation, [brackets] denote inclusions and (parentheses) denote exclusions. Thus, we bracket each interval’s initial time and place a parenthesis around its concluding time, writing these 13 intervals as [0, 1), [1, 2), [2, 3),…, [12, 13). Teachers are hired at time 0, so the 0th interval refers to that period of time between contract signing and the first day of school, a period when no event can occur. Each subsequent interval—labeled 1 through 12—refers to a specific year of the career. We define a year as that period of time between the first day of school in the fall and the end of the associated summer. The first day of school for the following academic year falls into the next interval. Under this partition, any event occurring between the first day of year 1 up to (but excluding) the first day of year 2 is classified as occurring during year 1. Any event occurring during the second year—be it on the first day of school or the last day of summer—is classified as occurring during year 2.

When devising your own life tables, you should select the temporal partition most relevant for your chosen time metric and for the way in which events unfold. More generally, we represent any arbitrary division of time by using the letter t to denote time and the subscript j to index time periods. We then write a series of general time intervals as [t0, t1), [t1, t2), …, [tj−1, tj), [tj, tj+1) …, and so on. Any event occurring at t1 or later but before t2, is classified as happening during the first time interval [t1, t2). The jth time interval, written as [tj, tj+1), begins immediately at time tj and ends just before time tj+1. No events can occur during the 0th interval, which begins at time 0 and ends just before t1, the first observable event time. Conceptually, this interval represents the “beginning of time.”

The next column of the life table displays information on the number (p. 329 ) of individuals who enter each successive time period. Statisticians use the term risk set to refer to this pool: those eligible to experience the event during that interval. For intervals in which no one is censored—here, the early years of the career—identification of the risk set is straightforward. Each year’s risk set is just the prior year’s risk set minus those individuals who experienced the event during the prior year. The year 4 risk set (2742), for example, is just the year 3 risk set (3101) diminished by the 359 teachers who quit during their third year. In those intervals when censoring occurs—in our example, during the later years of the career—the risk set declines because of both event occurrence and censoring. The year 9 risk set (1256), for example, is just the year 8 risk set (1642) diminished by the 79 teachers who quit during year 8 and the 307 teachers who were censored at the end of year 8.

An essential feature of the risk set’s definition is that it is inherently irreversible: once an individual experiences the event (or is censored) in one time period, he or she drops out of the risk set in all future time periods. Irreversibility is crucial, for it ensures that everyone remains in the risk set only up to, and including, the last moment of eligibility. The risk set for year 9, for example, comprises only those individuals (1256 of the original sample of 3941) who taught continuously for at least 9 years. Individuals who left or were censored in a previous year are not “at risk” of leaving in year 9 and are therefore excluded from the risk set in this period and all subsequent periods.

Why is the concept of a risk set important? If censoring is non-informative (as described in section 9.3), we can assume that each interval’s risk set is representative of all individuals who would have been at risk of event occurrence in that interval had everyone been followed for as long as necessary to eliminate all censoring (that could be eliminated). To understand the implications of this assumption, consider the risk sets in years 7 (2045) and 8 (1642). If censoring is noninformative, the 1642 teachers in the year 8 risk set are representative of that subset of the 2045 teachers who would have entered their eighth year if we could have observed them through that point in time (that is, were there no censoring). For this to be true, the 280 teachers censored at the end of year 7, who could not be observed in year 8, must be no different from the 1642 teachers who were observed in year 8. Under this assumption, we can generalize the behavior of the 1642 people in the year 8 risk set back to the entire population of teachers who would have entered their eighth year. This allows us to analyze event occurrence among the members of each year’s risk set yet generalize results back to the entire population.

# (p. 330 ) 10.2 A Framework for Characterizing the Distribution of Discrete-Time Event Occurrence Data

Having described how the life table tallies data about event occurrence over time, we now introduce three invaluable statistical summaries of this information: the hazard function, the survivor function, and the median lifetime.

## 10.2.1 Hazard Function

The fundamental quantity used to assess the risk of event occurrence in each discrete time period is known as hazard. Denoted by h(tij), discrete-time hazard is the conditional probability that individual i will experience the event in time period j, given that he or she did not experience it in any earlier time period.1 Because hazard represents the risk of event occurrence in each discrete time period among those people eligible to experience the event (those in the risk set) hazard tells us precisely what we want to know: whether and when events occurs.

We can formalize this definition by adopting some notation. Let T represent a discrete random variable whose values Ti indicate the time period j when individual i experiences the target event. For a teacher who leaves in year 1, Ti = 1; for a teacher who leaves in year 8, Ti = 8. Methodologists typically characterize the distribution of a random variable like T by describing its probability density function, the probability that individual i will experience the event in time period j, Pr[Ti = j], or its cumulative density function, the probability that individual i will experience the event before time period j, Pr[Ti < j]. But because event occurrence is in herently conditional—an event can occur only if it has not already occurred—we characterize T by its conditional probability density function: the distribution of the probability that individual i will experience the event in time period j given that he or she did not experience it at any time prior to j. This is algebraically equivalent to the probability that the event will occur in the current time period, given that it must occur now, or sometime in the future, as follows:2 (10.1)

The set of discrete-time hazard probabilities expressed as a function of time—labeled h(tij)—is known as the population discrete-time hazard function.

We cannot overemphasize the importance of the conditionality inherent in the definition of hazard. Individual i can experience the event in time period j if, and only if, he or she did not already experience it any (p. 331 ) prior period. Conditionality ensures that hazard represents the probability of event occurrence among those individuals eligible to experience the event in that period—those in the risk set. As people experience events, they drop out of the risk set and are ineligible to experience the event in a later period. Because of this conditionality, the hazard probability for individual i in time period j assesses his or her unique risk of event occurrence in that period.

Notice that each individual in the population has his or her own discrete-time hazard function. This is similar to the way we specified individual growth models, by allowing each person to have his or her own true growth trajectory. Here, we specify that each individual, whom we ultimately distinguish from other members of the population on the basis of predictors (e.g., gender and subject specialty), has a hazard function that describes his or her true risk of event occurrence over time. In chapter 11, when we develop statistical models for predicting discrete-time hazard, we specify the relationship between parameters characterizing each person’s hazard function and predictors. For now, because we are simply describing the distribution of event occurrence for a random sample of individuals from a homogeneous population among whom we are not (yet) distinguishing, we drop the subscript i (that indexes individuals) and write the discrete-time hazard function for a random individual in this population as h(tj).

Although this definition of hazard may appear far removed from sample data, examination of column 6 of the life table reveals that it is a commonsense summary of event occurrence. Column 6 presents the proportion of teachers teaching at the beginning of each year who left by the end of the year. Phrased more generally, it presents the proportion of each interval’s risk set that experiences the event during that interval. Among these 3941 special educators, .1157 (n = 456) left by the end of their first year. Of the 3485 who stayed more than one year, .1102 (n = 384) left by the end of their second. Notice that these proportions, just like the definition of hazard, are conditional. Each represents the fraction of that year’s risk set that leaves that year. This allows the proportions to be computed easily in every year, regardless of censoring. Among the 2045 teachers who taught continuously for six years, for example, .0601 (n = 123) left by the end of their seventh; of the 948 who taught continuously for 9 years, .0369 (n = 35) left at the end of their tenth.

What is the relationship between the population definition of discrete-time hazard in equation 10.1 and these sample proportions? Quite simply, these proportions are maximum likelihood estimates of the discrete-time hazard function (Singer & Willett, 1993). They are also the discrete limit of the well-known Kaplan-Meier estimates of hazard for continuous-time (p. 332 ) data (Efron, 1988). More formally, if we let n eventsj represent the number of individuals who experience the target event in time period j and n at riskj represent the number of individuals at risk during time period j, we estimate the value of discrete-time hazard in time period j as: (10.2)

Thus, we estimate to be 0.1157, to be .1102, and so on. Because no one is eligible to experience the target event during the initial time interval, here [0, 1), h(t0) is undefined.

The magnitude of hazard in each time interval indicates the risk of event occurrence in that interval. When examining estimated values of discrete-time hazard, remember that:

• As a probability, discrete-time hazard always lies between 0 and 1.

• Within these limits, hazard can vary widely. The greater the hazard, the greater the risk; the lower the hazard, the lower the risk.

Examining the estimated hazard function for the special educator data displayed in table 10.1, we see that in the first four years of teaching, hazard is consistently high, exceeding .10. This indicates that over 10% of the teachers still teaching at the beginning of each of these years leaves by the end of the year. After these initial “hazardous” years, the risk of leaving declines steadily over time. By year 8, hazard never exceeds 5%, and by year 10, it is just barely above 0.

A valuable way of examining the estimated discrete-time hazard function is to graph its values over time. The top panel of figure 10.1 presents the kind of plot we consider most useful, and we often display such a plot in lieu of tabling estimated hazard probabilities. Although some methodologists present discrete-time hazard functions as a series of lines joined together as a step function, we follow the suggestions of Miller (1981) and Lee (1992) and plot the discrete-time hazard probabilities as a series of points joined together by line segments. Plots like these can help you to:

• Identify especially risky time periods—when the event is particularly likely to occur

• Characterize the shape of the hazard function—determining whether risk increases, decreases, or remains constant over time

In this study, like many other studies of employee turnover, the estimated hazard function peaks in the first few years and declines thereafter. (p. 333 )

Figure 10.1. Estimated sample hazard and survivor functions for the 3941 special educators (with estimated median lifetime in parentheses on the plot of the sample survivor function).

Novice special educators, or those with only a few years of experience, are at greatest risk of leaving teaching. Once they gain experience (or perhaps, tenure), the risk of leaving declines. We present additional examples of estimated hazard functions in section 10.3, after we finish introducing the remaining elements of the life table.

## (p. 334 ) 10.2.2 Survivor Function

The survivor function provides another way of describing the distribution of event occurrence over time. Unlike the hazard function, which assesses the unique risk associated with each time period, the survivor function cumulates these period-by-period risks of event occurrence (or more properly, nonoccurrence) together to assess the probability that a randomly selected individual will “survive”—will not experience the event. Formally denoted by S(tij), the survival probability is defined as the probability that individual i will survive past time period j. For this to happen, individual i must not experience the target event in the jth time period or in any earlier period. In terms of T, the random variable for time, this implies that teacher i will still be teaching at the end of year j; in other words, Ti exceeds j. We therefore write the survival probability for individual i in time period j as: (10.3) and we refer to the set of survival probabilities expressed as a function of time—S(tij)—as that individual’s survivor function. As before, when we do not distinguish people on the basis of predictors, we write the survivor function for a random member of the population without the subscript i as S(tj).

How does a survivor function behave over time? At the beginning of time, when no one has yet experienced the event, everyone is surviving, and so by definition, its value is 1. Over time, as events occur, the survivor function declines toward 0 (its lower bound). In those time periods when hazard is high, the survivor function drops rapidly. In those time periods when hazard is low, it declines slowly. But unlike the hazard function, which can increase, decrease, or remain the same between adjacent intervals, the survivor function will never increase. When passing through time periods when no events occur, the survivor function simply remains steady at its previous level.

There are two ways of using sample data to compute maximum likelihood estimates of the population survivor function. The direct method, presented in the last column of table 10.1, can be used only in those intervals that precede the first instance of censoring. Although this limitation renders this method impractical for everyday use, we begin with it for its pedagogic value. To understand this method, think about what it means to “survive” through the end of an interval. For this to happen, a teacher must still be teaching by the end of that year. Column 6 captures this idea by presenting the proportion of all teachers (that is, of all 3941 teachers) still teaching by the end of each year. We see that .8843 (3485/3941) of (p. 335 ) the entire sample teach (survive) more than one year, .7869 (3101/3941) teach more than two years, and .5189 (2045/3941) teach more than six years. More generally we write:

In year 7 and beyond, we can no longer compute these proportions because we do not know the event times of the censored teachers. We therefore have no way of knowing how many people did not experience the target event by the end of each of these later time periods.

The alternative method, which can be used regardless of censoring, proceeds indirectly, capitalizing on the information about event occurrence contained in the estimated hazard function. The idea is that, for each interval, the estimated hazard probability tells us not only about the probability of event occurrence but also about the probability of nonoccurrence, which in turn tells us about survival. Let us first review the logic in the early years, before censoring takes its toll. The year 1 estimated hazard probability of .1157 tells us that (1 – .1157) or .8843 of the original sample survives through the end of the first year. Similarly, the year 2 estimated hazard probability of .1102 tells us that (1 – .1102) or .8898 of those special educators who enter year 2 survive through the end of that year. But because only .8843 of the original sample actually enters their second year, only .8898 of the .8843, or .7869 of the original sample, survives through the end of year 2. We therefore estimate the survival probability for year 2 to be (.8898) (.8843) = .7869, a value identical to that obtained from the direct method.

We can use the indirect method to estimate values of the survivor function in any year, even in the presence of censoring. The estimated survival probability for year j is simply the estimated survival probability for the previous year multiplied by one minus the estimated hazard probability for that year: (10.4)

For example, we estimate that .5189 of all teachers survive through the sixth year of teaching. Because the estimated hazard probability for year 7 is .0601, we estimate that .9399 of those in the seventh-year risk set will not leave teaching that year. An estimate of the survival probability at the end of year 7 is thus (.5189) (.9399) =.4877. We have used this formula to estimate the survival probabilities for years 7 through 12 in table 10.1 (shown in italics). Plotted values appear in the lower panel of figure 10.1.

(p. 336 ) The estimated survivor function provides maximum likelihood estimates of the probability that an individual randomly selected from the population will “survive”—not experience the event—through each successive time period. In figure 10.1, notice that unlike the hazard function, which is presented only in the first time interval and beyond, the survivor function takes on the value 1.0 for interval 0—the origin of the time axis. As events occur, the estimated survivor function drops, here to .6958 by year 3, to .5656 by year 5, to .4877 by year 7, and to .4446 by year 9. Because many teachers stay for more than 12 years, the estimated survivor function does not reach zero, ending here at .4123. An estimated 41% of all special educators teach for more than 12 years; by subtraction, an estimated 59% leave in 12 years or less.

Notice that our estimate of the percentage of teachers still teaching after 12 years (41%) differs from the percentage still teaching at the end of data collection, 44% (1734/3941). Although small in this data set, this differential can be large, and it speaks volumes about what happens during analysis. Until the first censored event time, we can compute the percentage of the sample who survive directly so that these two percentages are identical. Once censoring occurs, we can no longer estimate the survivor function directly; we must estimate indirectly based upon those individuals who remain in the risk set. The beauty of survival analysis is that, under the assumption of independent censoring, we can use the risk set to estimate what would have happened to the entire remaining population were there no censoring. For example, although we know about year 12 event occurrence for only those 391 special educators in the first entry cohort who taught for 12 years, we can use these data to estimate what would have happened to teachers in the later cohorts were they to teach for 12 consecutive years. It is through this extrapolation that we generalize our sample results (including data on the censored individuals) back to the entire population.

Before leaving this discussion of the survivor function, let us resolve one small detail about its estimation. Use of equation 10.4 in any time interval requires an estimate of the function in the previous interval. Is there any way to eliminate this dependence, allowing the survivor function to be estimated solely on the basis of the hazard function? To see that the answer is yes, use equation 10.4 to write an expression for the sample survivor function in year (j – 1):

By repeatedly substituting this type of formula into equation 10.4 until time 0, when S(t0) = 1.0, we find: (p. 337 ) (10.5)

In other words, each year’s estimated survival probability is the successive product of the complement of the estimated hazard probabilities across this and all previous years. For example, an estimate of the year 7 survival probability is (1 – .0601) (1 – .0825) (1 – .0891) (1 – .1076) (1 – .1158) (1 – .1102) (1 – .1157) = .4877. Equation 10.5 allows us to estimate the survivor function directly from the estimated hazard function. Unfortunately, censoring prevents us from working in the opposite direction, estimating the hazard function directly from the estimated survivor function.

## 10.2.3 Median Lifetime

Having characterized the distribution of event times using the hazard and survivor functions, we often want to identify the distribution’s center. Were there no censoring, all event times would be known, and we could compute a sample mean. But because of censoring, another estimate of central tendency is preferred: the median lifetime.

The estimated median lifetime identifies that value of T for which the value of the estimated survivor function is .5. It is the point in time by which we estimate that half of the sample has experienced the target event, half has not. Examining the estimated survivor function presented in column 6 of table 10.1, we know that the estimated median lifetime falls somewhere between year 6 (when an estimated .5189 of the teachers are still working at the end of the year) and year 7 (when this proportion drops below .5 to .4877). Because our metric for time is discretized at the year level, one way to report this conclusion would be to write that the average special educator leaves after completing six, but not a full seven, years of teaching.

Another way to estimate the median lifetime is to use interpolation. Interpolation is most useful when comparing subsamples, especially if medians fall in the same time interval. Even if this happens, the sub-samples rarely have identical estimated survivor functions, suggesting that a median computed without interpolation is too coarse to characterize the differing survival experiences of the groups.

Following Miller (1981), we linearly interpolate between the two values of S(tj) that bracket .5. Let m represent the time interval when the sample survivor function is just above .5 (here, year 6), let represent the value of the sample survivor function in that interval, and let represent its value for the following interval (when it must be just below .5), we estimate the median lifetime as: (p. 338 ) (10.6) For the special educators, we compute the estimated median length of stay to be: In the lower panel of figure 10.1, we graphically illustrate this interpolation by drawing a line parallel to the time axis when the estimated survivor function equals .50 and by then dropping a perpendicular from the estimated survivor function to the time axis to identify the corresponding value of T.

Unlike the biased estimates of mean duration presented in section 9.3.3, the estimated median lifetime of 6.6 years correctly answers the question “How long does the average teacher teach?” We now see that the answer is not the mean 3.7 years we calculated by setting aside all censored observations nor the mean 7.5 years we calculated by treating all censored event times as known event times. Notice, too, that although we derive this estimate through a circuitous route—by first estimating a hazard function, then a survivor function, and finally a median lifetime—our answer is expressed in a comprehensible metric. Perhaps because physicians now routinely provide estimated median lifetimes to patients following diagnosis of an illness or initiation of treatment, you can use these summaries in other fields to communicate results. Although it is wise to remind your audience that this estimate is just a median—half the teachers stay for less than 6.6 years, the other half stay longer (or in some studies, may never experience the target event)—the statistic has much intuitive appeal.

What should you do if the estimated survivor function does not reach .5? This tells you that less than half of the population is predicted to experience the target event by the last time in the life table. This dilemma arises in studies of short duration or of rare (or less common) events, such as the onset of mental illness or illicit drug use. Although we can estimate a different percentile of the survivor function (say the 75th percentile), more often researchers present cumulative survival rates, values of the estimated survivor function after pre-specified lengths of time. In medical research, one-year, three-year, and five-year survival rates are common. In your own study, choose benchmarks suitable for the metric for time and the rate at which events occur. When studying teachers’ (p. 339 ) careers, for example, estimated five- and ten-year survival rates (here 57% and 43% respectively) are common.

# 10.3 Developing Intuition About Hazard Functions, Survivor Functions, and Median Lifetimes

Developing intuition about these sample statistics requires exposure to estimates computed from a wide range of studies. To jump-start this process, we review results from four studies that differ across three salient dimensions—the type of event investigated, the metric used to record discrete time, and most important, the underlying profile of risk—and discuss how we would examine, and describe, the estimated hazard functions, survivor functions, and median lifetimes.

The four panels of figure 10.2 provide the basis for our work.

• Panel A presents data from Hall, Havassy, and Wasserman’s (1990) study of relapse to cocaine use among 104 former addicts released from an in-patient treatment program. After 12 weekly follow ups, 62 people had relapsed; 42 (40.4%) were censored (drug-free).

• Panel B presents data from Capaldi, Crosby, and Stoolmiller’s (1996) study of the grade when a sample of at-risk adolescents males had heterosexual intercourse for the first time. Among 180 boys tracked from seventh grade, 54 (30.0%) were still virgins (were censored) when data collection ended in 12th grade.

• Panel C describes the age at first suicide ideation for the 391 undergraduates in Bolger and colleagues’ (1989) study introduced in section 9.1.3. Recall that 275 undergraduates reported having previously thought about suicide; 116 (29.7%) were censored (had not yet had a suicidal thought).

• Panel D describes how long female members of the U.S. House of Representatives remain in office. This data set tracks the careers of all 168 women who were elected between 1919 and 1996, for up to eight terms (or until 1998); the careers of 63 (37.5%) were censored.

## 10.3.1 Identifying Periods of High and Low Risk Using Hazard Functions

Hazard functions are the most sensitive tool for describing patterns of event occurrence. Unlike survivor functions, which cumulate information (p. 340 )

Figure 10.2. Estimated hazard functions, survivor functions, and median lifetimes from four studies: Panel A: Time to cocaine relapse after release from an in-patient treatment program. Panel B: Grade at first heterosexual intercourse for males. Panel C: Age at first suicide ideation. Panel D: Duration of uninterrupted congressional careers for female representatives.

(p. 341 ) across time, hazard functions display the unique risk associated with each time period. By examining variation over time in the magnitude of the hazard function, we identify when events are particularly likely, or unlikely, to occur.

We describe variation in the magnitude of the hazard function by locating its distinctive peaks and troughs. Peaks pinpoint periods of elevated risk; troughs pinpoint periods of low risk. When identifying peaks and troughs, be sure to look beyond minor period-to-period differences that may reflect nothing more than sampling variation (discussed in section 10.4). The goal is to learn about “the big picture”—the general profile of risk over time. Imagine stepping back from the estimated hazard functions in figure 10.2, glossing over the small inevitable zigzags between adjacent time periods, focusing instead on the function’s overall shape. From this vantage point, the pronounced peaks and troughs appear at different locations along the time axis. The top two hazard functions have a single distinctive peak and a single distinctive trough—they are monotonic. The bottom two hazard functions have multiple distinctive peaks or troughs—they are nonmonotonic. Although the length of data collection affects our ability to identify peaks and troughs (in ways we will soon describe), let us examine each of these functions in greater detail, describing broadly how risk rises and falls over time.

Aside from sampling variation, the hazard function in Panel A peaks immediately after the “beginning of time” and declines rather steadily thereafter. Recently treated former cocaine addicts are most likely to relapse shortly after leaving treatment. Over time, as they acclimate to life outside the hospital, the risk of relapse declines. Monotonically decreasing hazard functions like these—also shown in figure 10.1 for the teacher turnover data—appear throughout the social and behavioral sciences. To some extent, this preponderance reflects social scientists’ fascination with recurrence and relapse. Whether the target event is substance abuse (e.g., Hasin et al., 1996), mental illness (e.g., Mojtabia et al., 1997), child abuse (e.g., Fryer & Miyoshi, 1994), or incarceration (e.g., Harris & Koepsell, 1996; Brown, 1996), risk of recurrence is highest immediately after treatment, identification, or release. Monotonically decreasing hazard functions arise when studying other events as well. Two examples are Hurlburt, Wood and Hough’s (1996) study of whether and when homeless individuals find housing and Diekmann, Jungbauer-Gans, Krassnig, and Lorenz’s (1996) study of whether and when drivers respond aggressively to being blocked by a double-parked car.

The hazard function in Panel B is also monotonic but in the opposite direction: it begins low and increases over time. Few boys initiated intercourse in either seventh or eighth grade. Beginning in ninth grade, the (p. 342 ) risk of first intercourse increases annually among those who remain virgins. In ninth grade, for example, an estimated 15.0% of the boys who had not yet had sex do so for the first time; by 12th grade, 31.7% of the remaining virgins (admittedly only 45.1% of the original sample) do likewise. Monotonically increasing hazard functions are common when studying events that are ultimately inevitable (or near universal). At the “beginning of time,” few people experience the event, but as time progresses, the decreasing pool of individuals who remain at risk succumb. Keifer (1988) found a similar pattern when characterizing the time it takes to settle a labor dispute, as did Campbell, Mutran, and Parker (1987), who studied how long it takes workers to retire.

Some hazard functions display multiple peaks or troughs. Panel C suggests that the risk of suicide ideation is low during childhood, peaks during adolescence, and then declines to near (but not quite) early childhood levels in late adolescence. (Do not pay much attention to the apparent increase in the last time period, for it is little more than sampling variation.) Diekmann and Mitter (1983) found a similar type of hazard function when they asked a sample of young adults to retrospectively report whether and if so when they had ever shoplifted. They found that the age at first shoplift varied widely, from age 4 to 16, with a peak during early adolescence—ages 12 to 14. In a different context, Gamse and Conger (1997) found a similar shape hazard function when following the academic careers of recipients of a postdoctoral research fellowship. The hazard function describing time to tenure was low in the early years of the career, peaked in years 6 through 8, and declined thereafter.

The U-shaped pattern in Panel D is nicknamed the “bathtub” hazard function. Risk is high at two different moments: immediately after the “beginning of time” and again, at the end of time. In studies of human lifetimes, especially in developing countries, the high initial risk reflects the effects of infant mortality while the later high risk reflects the effects of old age. Here, we find a similar pattern. Congresswomen are at greatest risk of leaving office at two points in their careers: immediately after their first election and then after having served for a long period of time (seven or eight terms). In the middle period—between the second and sixth terms—the effects of incumbency reign, with relatively few continuing representatives stepping down or losing an election.

Nonmonotonic hazard functions, like those in Panels C and D, generally arise in studies of long duration. This design dependency arises for a simple reason: in brief studies, the particular time associated with the middle peak (or trough) appears on the far right of the time axis—erroneously suggesting a monotonically increasing (or decreasing) hazard. (p. 343 ) The problem is that when data collection is brief (and brief can be years!), we have no way of knowing what happens in time periods that occur after the end of data collection. To find a reversal, indicated by the multiple peaks (or troughs), data collection must be of a sufficient length.

This illustrates the need for a caveat whenever describing hazard functions: Be sure that the time indicated at the end of the time axis has substantive meaning as an “end of time.” If not, use extreme caution when identifying (and describing) the varying pattern of risk. In Panel B, for example, had Capaldi and colleagues (1996) followed the 56 young men who had not had sex by the end of 12th grade, they might have found that the annual risk of initiation peaks even later, after high school. Or had we followed the congresswomen in Panel D for only four terms, we would have concluded that risk monotonically decreases over time. Although this conclusion is accurate for the first four terms, we would not want to erroneously generalize this short-term finding. Statements about periods of elevated (or diminished) risk must always be tied to statements about the range of time actually studied. Failure to do so is tantamount to extrapolating (through silence) beyond the range of the data.

What happens if the hazard function displays no peaks or troughs? When hazard is flat, risk is unrelated to time. Under these circumstances, event occurrence is independent of duration in the initial state implying that events occur (seemingly) at random. Because of age, period, and cohort effects—all of which suggest duration dependence—flat hazard functions are rare in the social and behavioral sciences. Two interesting examples, however, are whether and when couples divorce following the birth of a child (Fergusson, Horwood, & Shannon, 1984) and whether and when elementary school children shift their attention away from their teacher (Felmlee & Eder, 1984).

## 10.3.2 Survivor Functions as a Context for Evaluating the Magnitude of Hazard

As is apparent in figure 10.2, all survivor functions share a common shape, a monotonically nonincreasing function of time. At the beginning of time, each takes on the value 1.0. Over time, as events occur, each drops toward 0. Because of censoring, and because some individuals may never experience certain events no matter how long data collection lasts, few estimated survivor functions fall to zero. The value of the survivor function at the “end of time” estimates the proportion of the population that will survive past this last observed period.

(p. 344 ) Studying the concurrent features of the estimated hazard and survivor functions in figure 10.2 reveals a great deal about their interrelationship. Examining the four panels we see that:

• When hazard is high, the survivor function drops rapidly—as in the early time periods in Panels A and D.

• When hazard is low, the survivor function drops slowly—as in the early time periods in Panels B and C and the later time periods of A and C.

• When hazard is zero, the survivor function remains unchanged. Although not shown here, if h(tj) = 0, S(tj) will be identical to S(tj+1).

In general, large values of hazard produce great changes in survivorship; small values produce little change.

If the survivor function is simply a cumulative reflection of the magnitude of the peaks and troughs in the hazard function, of what additional value is it? One advantage is its intuitive appeal, which renders it useful when communicating findings. More important, though, the survivor function provides a context for evaluating the period-by-period risks reflected in the hazard function. Because the survivor function cumulates these risks to estimate the fraction of the population remaining in each successive time period, its value indicates the proportion of people exposed to each period’s hazard. If the estimated survival probability is high when hazard is high, many people are affected; if it is low, even if hazard is high, there are few people left to experience the elevated risk.

The practice of using the survivor function to provide a context for evaluating the magnitude of hazard is similar to epidemiologists’ practice of studying both prevalence and incidence. Incidence measures the number of new events occurring during a time period (expressed as a proportion of the number of individuals at risk), whereas prevalence cumulates these risks to identify the total number of events that have occurred by a given point in time (also as a proportion; see, e.g., Kleinbaum, Kupper, & Morgenstern, 1982; Lilienfeld & Stolley 1994). Stated this way, we can see that incidence and prevalence correspond directly to hazard and survival: hazard represents incidence, survival represents cumulative prevalence. Epidemiologists rely on incidence when identifying the risk factors associated with disease occurrence because prevalence confounds incidence with duration—conditions of longer duration may be more prevalent even if they have equal or lower incidence rates. But epidemiologists also recognize that prevalence, like survival, has an advantage: it assesses the extent of a problem at a particular point in time. Estimates of prevalence thereby provide a context for evaluating (p. 345 ) the magnitude of incidence. In survival analysis, estimates of the survivor function provide a similar context for evaluating the magnitude of hazard.

The consequence of this argument is that the survivor function indicates whether the elevated risks in periods of high hazard are likely to affect large numbers, or small numbers, of people. At the extreme, if risk is high among a very small group, the times of greatest risk may not be the times when most events occur. If hazard is increasing while the risk set is decreasing, a high hazard may have little effect. In both the age at first intercourse study and the congressional turnover study, the last periods are those with the highest hazards. But the risk sets in these periods are smaller than the risk sets in the earlier periods, so the elevated hazard may indicate that fewer total events take place in these periods than did in the earlier periods with lower hazards. For example, three times as many congresswomen (n = 19) leave office in their second term, when hazard is .17, than leave in their eighth term (n = 6), when hazard is .27.

This irregular correspondence between hazard and the number of events does not indicate a flaw in the concept of hazard; rather, it underscores the need for examining the survivor function. Hazard is inherently conditional: it only describes the risk of event occurrence among those at risk. Be sure to reassert this conditionality periodically so that you do not mistakenly conclude that more events occur in time periods when fewer events actually occur.

## 10.3.3 Strengths and Limitations of Estimated Median Lifetimes

Unlike hazard and survivor functions, which describe the distribution of event times, the median lifetime identifies the distribution’s location or “center.” Examining the estimated median lifetimes displayed in figure 10.2, for example, we see that the average former addict relapses 7.6 weeks after treatment (Panel A), the average at-risk adolescent male initially has heterosexual intercourse midway through the second semester of tenth grade (Panel B), the average teenager has had a suicidal thought by age 14.8 years old (Panel C), and the average U.S. congresswoman remains in office for 3.5 terms (Panel D).

When examining a median lifetime, we find it helpful to remember three important limitations on its interpretation. First, it identifies only an “average” event time; it tells us little about the distribution of event times and is relatively insensitive to extreme values. Second, the median lifetime is not necessarily a moment when the target event is especially (p. 346 ) likely to occur. For example, although the average congresswoman remains in office for just under four terms, hazard is actually low during the fourth term. Third, the median lifetime reveals little about the distribution of risk over time; identical median lifetimes can result from dramatically different survivor and hazard functions.

We illustrate these insights in figure 10.3, which presents estimated hazard functions, survivor functions, and median lifetimes for four hypothetical data sets. We constructed these data sets purposefully, with the goal of highlighting the difficulties inherent in the interpretation of median lifetimes. In each data set, comprising ten time intervals, the estimated survival probability in period 5 is exactly .50 so that the estimated median lifetime is precisely 5.0.

Notice the dramatic differences in the accompanying hazard and survivor functions. Although all four data sets have the same estimated median lifetime, few researchers examining these panels would conclude that the studies had anything in common. In Panel A, hazard begins low, rises steadily until its peak in period 5, and then declines steadily until period 10. This type of situation, in which the estimated median lifetime of 5.0 coincides with the period of greatest risk, is what most people initially believe the estimated median lifetime suggests.

But before concluding that an estimated median lifetime tells us anything about the shape of the hazard or survivor function, examine Panel B. The first half of the hazard function in this panel is identical to that of Panel A—it begins low and climbs steadily to its peak in period 5. After that point in time, however, hazard remains high, at the same value as in time period 5. Yet the median lifetime remains unchanged because its computation depends solely on the early values of the estimated survivor function (before it reaches .50). Its later values have no effect whatsoever on the calculation.

The remaining two data sets present even more extreme relationships between profiles of risk and median lifetimes. In Panel C, hazard begins high and declines steadily over time. Here, the estimated median lifetime of 5.0 corresponds to a low risk period and the distribution of risk over the entire time axis looks entirely different from that presented in Panels A and B (although the estimated medians of the survivor functions are identical). A similar conclusion comes from examining Panel D. Here, hazard is constant across time and yet the estimated median lifetime falls in the same exactly place on the time axis.

What conclusions should you draw from this exercise?

• Never draw inferences about the hazard or survivor functions on the basis of an estimated median lifetime. All this statistic does is identify one (p. 347 )

Figure 10.3. Learning to interpret median lifetimes. Results from four hypothetical data sets constructed so that each has the same estimated median lifetime (5.0), but dramatically different estimated hazard and survivor functions.

(p. 348 ) particular point—albeit a meaningful one—along the estimated survivor function’s path.

• Never assume that the time corresponding to the estimated median lifetime is one of particularly high risk. The estimated median lifetime tells us nothing about hazard in that, or any other, time period.

• Remember that a median is just a mediannothing more than one estimate of the location of a distribution. If you want to know more about the distribution of event occurrence, the hazard function and, to a lesser extent, the survivor function are more useful.

# 10.4 Quantifying the Effects of Sampling Variation

When examining estimated hazard functions, we suggested that small period-to-period fluctuations were likely due to sampling variation. We now quantify the degree of sampling variation by computing the standard errors of the estimated hazard probabilities (section 10.4.1) and survival probabilities (section 10.4.2).

## 10.4.1 The Standard Error of the Estimated Hazard Probabilities

Consider the population value of hazard in the jth time period, h(tj). Using equation 10.2, we can estimate this parameter as that fraction of time period j’s risk set (nj) who experience the target event in that period. Because this estimate is simply a sample proportion, its standard error can be estimated using the usual formula for estimating the standard error of a proportion: (10.7)

The left side of table 10.2 presents estimated hazard probabilities and their accompanying standard errors for the special educator data. We present these estimates to seven decimal places so that you can confirm our calculations with precision. Notice that all the standard errors are very small, never exceeding .007, indicating that each hazard probability is estimated very precisely. Precision is a direct consequence of the large number of teachers being tracked and the relatively low annual exit rates. The hazard probability for year 1 is estimated using a risk set of 3941 and even the much-diminished risk set for year 11 has 391 members. If your (p. 349 )

Table 10.2: Calculating standard errors for estimated hazard and survival probabilities

Hazard function

Survivor function

Year

nj

Estimated hazard probability

Standard error

Estimated survivor probability

Term under the square root sign

Standard error

1

3941

0.1157067

0.0050953

0.8842933

0.0000332

0.0050953

2

3485

0.1101865

0.0053041

0.7868561

0.0000687

0.0065235

3

3101

0.1157691

0.0057455

0.6957625

0.0001109

0.0073288

4

2742

0.1075857

0.0059173

0.6209084

0.0001549

0.0077282

5

2447

0.0890886

0.0057588

0.5655925

0.0001948

0.0078958

6

2229

0.0825482

0.0058289

0.5189038

0.0002352

0.0079589

7

2045

0.0601467

0.0052576

0.4876935

0.0002665

0.0079622

8

1642

0.0481120

0.0052812

0.4642295

0.0002973

0.0080048

9

1256

0.0421974

0.0056726

0.4446402

0.0003324

0.0081067

10

948

0.0369198

0.0061243

0.4282242

0.0003728

0.0082686

11

648

0.0246913

0.0060961

0.4176508

0.0004119

0.0084764

12

391

0.0127877

0.0056821

0.4123100

0.0004450

0.0086981

initial sample size is smaller and the rate of decline in the risk set steeper, the standard errors for each successive time interval will be larger.

How can you develop your intuition about the magnitude of these standard errors? Although a precise answer involves simultaneous examination of the numerator and denominator of equation 10.7, we can examine each component separately to develop two general ideas:

• The closer hazard is to .50, the less precise the estimate; the closer hazard is to 0 (or 1), the more precise. The numerator in equation 10.7 is at its maximum when the estimated value of hazard is .50 and it declines as it goes toward either 0 or 1. Because the estimated value of hazard is usually below .5 (less than half of a risk set experiences the event), we can simplify this statement to say that larger values of hazard are usually measured less precisely and smaller values measured more precisely (for the same size risk set).

• The larger the risk set, the more precise the estimate of hazard; the smaller the risk set, the less precise. Because the size of the risk set appears in the denominator of equation 10.7, the estimated standard error will be larger in those time periods when fewer people are at risk. As the risk set declines over time, later estimates of hazard will tend to be less precise than earlier estimates.

(p. 350 ) Why, then, do we not observe a more dramatic increase in the standard error for hazard in the special educator data in table 10.2? This stability results from two phenomena. First, the estimated value of hazard declines over time, so the general increase in standard error that accompanies a decrease in the size of the risk set is counterbalanced by the decrease in the standard error that accompanies hazard’s decline. Second, although the size of the risk set declines over time, even the last time period in this data set contains 391 individuals. Were the later risk sets smaller, we would observe a more noticeable increase in the standard error of hazard.

## 10.4.2 Standard Error of the Estimated Survival Probabilities

Estimating the standard error of a survival probability is a more difficult task than estimating the standard error of its associated hazard probability. This is because unlike hazard, which is estimated as the fraction of the risk set that experience the target event in any given period, the survival probability is estimated as a product of (1-hazard) for this and all previous time periods. Estimating the standard error of an estimate that is itself the product of several estimates is a difficult statistical task. Indeed, it is so difficult that statisticians rarely recommend that you estimate the standard error of the survival probabilities directly; rather, you can do almost as well by relying on what is known as Greenwood’s approximation.

In an early classic paper on life tables, Greenwood (1926) demonstrated that the standard error of the survival probability in time period j can be approximated as: (10.8) The summation under the square root involves all time periods up to and including the time period of interest. The standard error of the estimated survivor function in time period 1 involves only the first term; the standard error in time period 2 involves only the first two terms. As the estimated survivor function in time period j depends upon the estimated hazard function in that time period as well as estimates from all preceding time periods (as shown in equation 10.5), it should come as no surprise that its standard error also involves the estimated values of hazard in all preceding time periods.

The standard errors of the estimated survival probabilities for the special educator data are shown in table 10.2. Each is very small, never (p. 351 ) even reaching 0.01. This suggests that our individual estimates of the survival probabilities, just like our individual estimates of the hazard probabilities, are quite precise. But as an approximation, Greenwood’s formula is accurate only asymptotically. Harris and Albert (1991) suggest that these standard errors should not be trusted for any time period in which the size of the risk set drops below 20.

# 10.5 A Simple and Useful Strategy for Constructing the Life Table

Having demonstrated the value of the life table, we now address the practical question: How can you construct a life table for your data set? For preliminary analyses, it is easy to use the prepackaged routines available in the major statistical packages. If you choose this approach, be sure to check whether your package allows you to: (1) select the partition of time; and (2) ignore any “actuarial” corrections invoked due to continuous-time assumptions (that do not hold in discrete time). When event times have been measured using a discrete-time scale, actuarial corrections (discussed in chapter 13) are inappropriate. Although most packages clearly document the algorithm being used, we suggest that you double-check by comparing results with one or two estimates computed by hand.

Despite the simplicity of preprogrammed algorithms, we prefer an alternative approach for life table construction. This approach requires construction of a person-period data set, much like the person-period data set used for growth modeling. Once you create the person-period data set, you can compute descriptive statistics using any standard cross-tabulation routine. The primary advantage of this approach rests not with its use for the descriptive analyses outlined in this chapter but in its use for model building. As will become apparent in chapters 11 and 12, the person-period data set is an integral tool for the systematic fitting of discrete-time hazard models. And as will become apparent in chapters 14 and 15, it also forms the conceptual foundation for fitting certain continuous time hazard models as well. An associated website presents code for creating person-period data sets in several major statistical packages.

## 10.5.1 The Person-Period Data Set

Like the person-period data set used for growth modeling, the person-period data set used for discrete-time survival analysis has multiple lines (p. 352 ) of data for each person under study. An important difference, however, is that the person-period data set used for growth modeling has a separate record for each time period when an individual is observed, whereas the person-period data set for discrete-time survival analysis has a separate record for each time period when an individual is at risk.

Researchers often store event history data in a “person-oriented” file, in which each individual’s data appears on a single record. Each record contains all the data ever collected for that person. As you collect additional longitudinal data, you add variables to the file. If you think of this file as a spreadsheet, with individuals indexed in rows and variables in columns, over time the file grows in width but never in length. A person-period data set, in contrast, spreads the data for each individual across multiple records, each record describing a specific time period. With each additional wave of data collection, the person-period data set grows in length. A person-period data set grows in width only if new variables, not assessed on a previous occasion, are added to the protocol.

Figure 10.4 illustrates the conversion from a person-oriented data set (in the left panel) to a person-period data set (in the right panel) using three individuals from the special educator study. The first two teachers have known event times (they stayed 3 and 12 years, respectively); the third was censored at 12 years. The person-oriented data set describes these teachers’ event histories using two variables:

• Event time (here, T). For the first two teachers with known event times, Ti is set to that time (3 and 12, respectively). For the third teacher, who was still teaching when data collection ended, Ti is also set to 12 (the last time period when the event could have occurred).

• Censoring indicator (here, CENSOR). For teachers with known event times (subjects 20 and 126), CENSOR = 0; for teachers with censored event times (subject 129), CENSOR = 1.

As there are 3941 teachers in this sample, the data set has 3941 records.

In the person-period data set, each individual has a separate record for each discrete-time period when he or she was at risk of event occurrence. Because individuals first become at risk during year 1, j = 1 is the first time period recorded in the person-period data set for each teacher. (In other data sets, it may be meaningful to count time from another origin; if so, simply set the values of this variable accordingly.) Teacher 20, who taught for three years, has three records, one for the first, second, and third years of teaching. Teacher 126, who taught for 12 years has 12 records, one per year, as does teacher 129 (who was still teaching when (p. 353 )

Figure 10.4. Conversion of a person-level data set into a person-period data set for three special educators from the teacher turnover study.

data collection ended, thereby remaining at risk of event occurrence in all 12 years). The values of the variables in the person-period data set reflect the status of person i on that variable in the jth period. Referring to the right panel of figure 10.4, the simplest person-period data set includes:
• A period variable, here PERIOD, which specifies the time-period j that the record describes. For teacher 1, this variable takes on the values 1, 2, and 3 to indicate that this teacher’s three records describe her status in these three years. For the other two cases, PERIOD takes on the values 1 through 12, to indicate that those are the years represented in the twelve records.

• An event indicator, here EVENT, which indicates whether the event occurred in that time period (0 = no event, 1 = event). For each (p. 354 ) person, the event indicator must be 0 in every record except the last. Noncensored individuals experience the event in their last period, so EVENT takes on the value 1 in that period (as in the third record for teacher 20 and in the 12th record for teacher 126). Censored individuals (such as teacher 129) never experience the event, so EVENT remains at 0 throughout.

If you want to include data on substantive predictors that might be associated with event occurrence (such as a teacher’s gender, salary, or classroom assignment), these variables could be easily added to the person-period data set. We discuss this extension at length in section 11.3.

Person-period data sets have many more records than their corresponding person-oriented data sets because they have one record for each time period that an individual is at risk of event occurrence. Since 3941 teachers were at risk in year 1, 3485 were at risk in year 2, 3101 were in risk in year 3, up through the 391 who were at risk in year 12, this person-period data set will have 3941 + 3485 + … + 391 = 24,875 records. We can also compute the number of records for which EVENT = 1 by subtracting the number of censored cases (those who will never receive the value 1) from the size of the original risk set (n). In this example, because 1734 of the original 3941 teachers have censored event times, we know that Event will take on the value 1 in only (3941 – 1734) = 2207 of the records and the value 0 in the remaining (24,875 – 2207) = 22,668.

## 10.5.2 Using the Person-Period Data Set to Construct the Life Table

All the life table’s essential elements can be computed through cross-tabulation of PERIOD and EVENT in the person-period data set. Any statistical package can produce the output, as displayed in table 10.3. To numerically verify the accuracy of this approach, compare these entries to the life table in table 10.1. Below, we explain why this approach works.

The cross-tabulation of PERIOD by EVENT in the person-period data set produces a J by 2 table. Each row j describes the event histories of those people in the risk set during the jth time period. The number of cases in the row (the TOTAL column in table 10.3) identifies the size of that period’s risk set. This is because an individual contributes a record to a period if, and only if, he or she is at risk of event occurrence in that period. The column labeled EVENT = 1 indicates the number of people experiencing the event in the jth period. This is because the variable EVENT takes on the value 1 only in the particular time period when the individual experiences the event. In all other time periods, EVENT must (p. 355 )

Table 10.3: Cross-tabulation of event indicator (EVENT) and time-period indicator (PERIOD) in the person-period data set to yield components of the life table

PERIOD

EVENT = 0

EVENT = 1

Total

Proportion EVENT = 1

1

3,485

456

3,941

0.1157

2

3,101

384

3,485

0.1102

3

2,742

359

3,101

0.1158

4

2,447

295

2,742

0.1076

5

2,229

218

2,447

0.0891

6

2,045

184

2,229

0.0825

7

1,922

123

2,045

0.0601

8

1,563

79

1,642

0.0481

9

1,203

53

1,256

0.0422

10

913

35

948

0.0369

11

632

16

648

0.0247

12

386

5

391

0.0128

Total

22,668

2,207

24,875

be 0 (as shown in the adjoining column). Given that the table provides period by period information about the size of the risk set and the number of people who experienced the event, it should come as no surprise to find that the row percentage (shown in the last column) estimates hazard, as shown in equation 10.2. Taken together, then, the cross-tabulation provides all information necessary for constructing the life table.

Notice how this cross-tabulation reflects the effects of censoring. In each of the first seven time periods, when no data are censored, the number of individuals for whom EVENT = 0 (the number not experiencing the event) is identical to the number of individuals at risk of event occurrence in the next time period. For example, because 2742 teachers did not leave teaching in year 3, these same 2742 teachers were eligible to leave teaching in year 4. After year 7, however, this equivalence property no longer holds: the number of individuals at risk of event occurrence in each subsequent year is smaller than the number who did not experience the event in the previous year. Why? The answer reflects the effects of censoring. The discrepancy between the number of events in the jth period and the number at risk in the (j + 1)st period indicates the number of individuals censored in the jth time period. For example, because 1563 teachers did not leave teaching in year 8 but only 1256 were at risk of leaving in year 9, we know that (1563 – 1256) = 307 were censored at the end of year 8 (as shown in table 10.1).

(p. 356 ) The ability to construct a life table using the person-period data set provides a simple strategy for conducting the descriptive analyses outlined in this chapter. This strategy yields appropriate statistics regardless of the amount, or pattern, of censoring. Perhaps even more important, the person-period data set is the fundamental tool for fitting discrete-time hazard models to data, using methods that we describe in the next chapter.

## Notes:

(1.) Strictly speaking, hazard is the conditional probability of event occurrence per unit of time. Because attention here is restricted to the case of discrete-time, where the unit of time is an “interval” assumed to be of length 1, we omit the temporal qualifier. When we move to continuous time, in chapters 13 through 15, we invoke a temporal qualifier (as well as an altered definition of hazard).

(2.) Owing to its genesis in modeling human lifetimes, the hazard function is also known as the conditional death density function and its realizations at any given time are known as the force of mortality (Gross & Clark, 1975). A comparison of these terms suggests that while the phrase “hazard function” has a (p. 610 ) negative connotation, the valence of this word is far milder than the valence of the competing alternatives!