# (p.270) Appendix A Technical Description of the Employed Methodology

# (p.270) Appendix A Technical Description of the Employed Methodology

The dependent variable of this study (arrest counts) has two properties associated with it that must be appropriately taken into account in any statistical model: (1) it is a non-negative count variable and (2) the data structures contain multiple observations per case (i.e. there is a lack of independence). We deal with each of these issues below.

# Modelling A Count Variable: the Poisson and Negative Binomial Regression Models

Given that the dependent variable is a non-negative count variable, the methods of analysis employed must take into account the discrete nature of this variable.^{1} If we were to apply standard OLS linear regression models that assume a continuous, normally distributed dependent variable as opposed to a skewed count dependent variable such as that used here, it would produce biased, inefficient, and inconsistent estimates of the covariates included in the model specification, as well as possibly predicting a negative number of events (King 1988; Long 1997). For these reasons, two general regression models based on a probability distribution that explicitly takes into account the discrete nature of count variables have been proposed: (1) the Poisson regression model and (2) the negative binomial regression model.

# The Poisson Regression Model

To begin, let us allow *y* _{it} to denote the observed event count of the *i ^{th}* individual (

*i*= 1,…

*n*) at time (age)

*t*(

*t*= 7, 8, 9,…, T

_{i}), where T

_{i}= max (age

_{i}). The univariate Poisson probability distribution function is specified as

(p.271) where pr((1)

*Y*

_{it}=

*y*

_{it}) indicates the probability that the random variable (

*Y*

_{it}) takes on the observed value (

*Y*

_{it}) for the

*i*individual at time

^{th}*t*, and

*λ*is the non-negative Poisson parameter representing the mean rate of event occurrence at time

*t*(Land, McCall, and Nagin 1996).

^{2}The Poisson distribution, which is a one-parameter probability distribution, makes a strong assumption regarding the relationship between the mean and variance of the random variable (

*Y*

_{it}). This assumption, known as the

*equidispersion*assumption, assumes that the mean and the variance are equal:

More generally, the expected value of a count variable such as this could be written as(2)

where(3)

*σ*

^{2}is the

*dispersion parameter*constrained to the greater than or equal to zero (King 1989). A count variable is said to exhibit equidispersion and to be Poisson-distributed if

*σ*

^{2}= 1, but if 0 ≤

*σ*

^{2}〉 1 then the variable is said to be

*underdispersed*, and if

*σ*

^{2}〈 1 then the variable is said to

*over-dispersed*(King 1989; Land, McCall, and Nagin 1996; Lindsey 1993, 1995). Over dispersion is a very common property among dependent variables utilized in social science data, whereas underdispersion is rare (King 1989).

The important point to make here is that overdispersion implies a significant substantive fact critically relevant to this study: there is unexplained variation in accounting for why some subjects have greater or fewer total arrests (events) than do other subjects. Stated differently, there is more heterogeneity in the mean event rate *among the individuals* than would be expected according to the Poisson distribution. One possible way to account for why some individuals have a higher mean arrest rate than others is to specify a Poisson regression model whereby a set of measured covariates are included through the equality

or equivalently(4)

(p.272) where(5)

**X**

_{it}is a matrix of measured covariates on individual

*i*at time

*t*, and

**β**is a column vector of regression coefficients relating the covariates to the mean arrest rate.

^{3}

According to Land, McCall, and Nagin (1996), inclusion of measured covariates in the model specification now leads to a conditional expectation function whereby the expected mean and variance of the event count are conditional on the X matrix such that

Similar to the deterministic relationship stated above in equation (2), equation (6) still implies a deterministic relationship, only now it is a conditional deterministic relationship (conditional on the measured covariates), whereas before it was an unconditional deterministic relationship (Hausman, Hall, and Griliches 1984).(6)

^{4}However, conditional on the observed covariates, the observed relationship is still nonstochastic.

As noted by Land, McCall, and Nagin (1996), the equidispersion assumption of the Poisson regression model is an unrealistic expectation in many social science data sets, and furthermore the failure to satisfy the equidispersion assumption leads to underestimated standard errors and inflated t-ratio tests of significance (see also Dean 1998; Hardin and Hilbe 2001). In other words, applying a Poisson regression model to data that cannot satisfy the equidispersion assumption can cause a covariate to appear to be a significant predictor of the outcome variable, when in fact it is not. In that situation, the statistical significance is spurious, due to the consequences of overdispersion. For this reason, methods that ‘scale’ the relationship between the mean and variance were sought (Hardin and Hilbe 2001). The primary method that is used in the presence of significant overdispersion in a Poisson regression model is to estimate a negative binomial regression model.

# The Negative Binomial Regression Model

As stated in Chapter 3, it is unrealistic to assume that every factor related to a dependent variable will be measured and included in all data sets, and thus there will always be some inherent variability (e.g. overdispersion) in the event counts between individuals that must be accounted for (Lindsey 1993, 1995; Land, McCall, and Nagin 1996). In the absence of
(p.273)
the measured covariates that explain the discrepancy, this variation is usually accounted for as *stochastic* or *random variation*in the dependent variable. Indeed, the precise reasoning behind fitting a negative binomial regression model (instead of a Poisson regression model) is to include stochastic variation in the event count (Hausman, Hall, and Griliches 1984; Cameron and Trivedi 1986; Land, McCall, and Nagin 1996).^{5} The negative binomial regression model is specified as

or(7)

where exp((8)

**ε**) is distributed as Γ(1, α). The α parameter is known as the dispersion parameter and plays a defining role in scaling the relationship between the mean and the variance as shown in equation (9) below. The inclusion of the gamma distributed error term allows for unexplained variation in ln(

*λ*

_{it}) (Land, McCall, and Nagin 1996). This unexplained variation can be thought of as having been produced in one of two ways: (1) through the effects of an omitted exogenous variable(s) (Gourieroux, Monfort, and Trognon 1984

*a*, 1984

*b*) or (2) through inherent stochastic variation (Hausman, Hall, and Griliches 1984). The negative binomial model is known in the statistical literature as a

*parametric mixed Poisson*regression model because a parametric mixing distribution (i.e. the gamma distribution) has been incorporated into the Poisson model.

There are two general formulations of the negative binomial model (Cameron and Trivedi 1986, 1998; Hardin and Hilbe 2001): (1) NB1 that specifies a linear mean-variance relationship; and (2) NB2 that specifies a quadratic mean-variance relationship. These models both consider the variance as a function of the mean (or Poisson parameter) such that

where(9)

*p*is equal to 1 in the NB1 formulation of the negative binomial models and

*p*is equal to 2 in the NB2 formulation. The NB1 and NB2 models are both Poisson-gamma mixture models, but each model provides a different specification of the mean-variance relationship. The log-likelihood functions for the NB1 and NB2 models are presented in Cameron and Trivedi (1998) and Hardin and Hilbe (2001).

(p.274)
NB2 is the more commonly used negative binomial model because it is the model that was first programmed into software packages such as LIMDEP (Greene 1998) and Stata (StataCorp. 2001) that were commonly used to model count data. As demonstrated by Land, McCall, and Nagin (1996), under the assumption that exp(**ε**) is distributed as Γ(1, α) and that **ε** is independent of **X**, the probability of observing the count *y* _{it} for the *i* ^{th} individual at time *t* in the NB2 model is:

where(10)

(11)

*λ*

_{it}= exp(

**X**

_{it}

**β**), Γ(·) is the gamma distribution, ν = 1/α, and α ≥ 0 (see also Cameron and Trivedi 1986). Estimates of α and

**β**are obtained using maximum likelihood methods (Hausman, Hall, and Griliches 1984; Cameron and Trivedi 1998; Land, McCall, and Nagin 1996). Under this specification, the mean and variance are (Cameron and Trivedi 1986, 1998; Land, McCall, and Nagin 1996):

and(12)

Thus, the expected number of events is still equal to(13)

*λ*

_{it}(or exp(

**X**

_{it}

**β**)), but the variance is no longer constrained to be equal to the mean; there is now a quadratic relationship between the mean and variance.

Under the NB1 model, a change is made to equation (11): ν = λ/α instead of ν = 1/α (Cameron and Trivedi 1998; Long 1997). The mean and variance under the NB1 specification are:

and(14)

Thus, the expected number of events is still equal to(15)

*λ*

_{it}, but the variance is now linearly related to the mean through the dispersion parameter.

Cameron and Trivedi (1998) recommend choosing between the NB1 and NB2 model on the basis of the log-likelihood values. The model with the
(p.275)
larger (less negative) log-likelihood value is favored, since they both are estimated using the same number of parameters.^{6} Substantively, however, the importance of the negative binomial model (whether specified as NB1 or NB2) is that individuals with identical values on the included covariates now have gamma distributed expected event counts, rather than being equal to the same conditional mean rate (as in the Poisson model).

In fact, the Poisson model is nested in the negative binomial model (Cameron and Trivedi 1986, 1998; Hardin and Hilbe 2001; Land, McCall, and Nagin 1996; Long 1997). The boundary or limiting case corresponds to α = 0, under which the negative binomial model becomes the Poisson model. For example, if α = 0 in equation (9), then we arrive back at the initial equidispersion assumption of the Poisson regression model found in equation (6). As Land, McCall, and Nagin (1996: 398) note, ‘this circumstance corresponds to the limiting case where all individuals have the same *λ* _{it} conditional on **X** _{it} which is precisely the assumption of the Poisson regression model.’

Of course, *α* as increases in size, so does the overdispersion of the data (Hardin and Hilbe 2001), and thus testing for the presence of significant overdispersion often becomes a primary task when modelling count data. The standard statistical test for assessing overdispersion involves testing the null hypothesis H_{N}: *α* = 0 against its alternative, H_{A}: *α* 〈 0. Because the Poisson model is a nested version of the negative binomial model, the test for significant overdispersion is frequently accomplished via a likelihood ratio test that compares the log-likelihood values of the negative binomial regression and Poisson regression models. The likelihood ratio test statistic is calculated as twice the difference in likelihood values, and this test statistic is distributed as χ^{2} with one degree of freedom. While this is how the test for overdispersion has been calculated in the past, currently it is recognized that this form of the likelihood ratio test is incorrect in this particular situation. More specifically, because the dispersion parameter has to be greater than or equal to 0, the null hypothesis sits on the boundary of the parameter space (Self and Liang 1987; Gutierrez, Carter, and Drukker 2001). Because the null hypothesis is on the boundary of the parameter space, a critical regularity condition is violated—‘the null parameter space is no longer interior to the full parameter space, and thus the result which states that the likelihood ratio test statistic tends towards [chi-square with one degree of freedom] in distribution is untrue’ (Gutierrez, Carter, and Drukker 2001: 16). As shown by Self and Liang
(p.276)
(1987), the correct test statistic is a 50:50 mixture of (1) a chi-square distribution with a point mass at zero and (2) a chi-square distribution with 1 degree of freedom . P-values calculated according to this 50: 50 mixture corresponds, in fact, to one-half the p-value calculated using only the upper tail area of the chi-square distribution with 1 degree of freedom.^{7}

Land, McCall, and Nagin (1996) note that the negative binomial regression model is a significant generalization of the Poisson model because it accommodates overdispersion in count data while simultaneously keeping the suitable features necessary to model count data. However, Land, McCall, and Nagin (1996) also note that the negative binomial regression is also a restrictive model because it assumes that the heterogeneity is gamma distributed in the population, which maybe an arbitrary assumption. If this assumption is incorrect, the standard errors of the regression coefficients will be spuriously deflated leading to inflated t-ratios.

More importantly, however, the negative binomial regression model, in its most basic form (as specified in this section), completely ignores the dependence among observations when it is applied to panel data. This is significant because serial dependence among observations is known to be a significant cause of overdispersion in count data (Lindsey 1993, 1995; Winkelmann 1995, 1997; Dean 1998; Pickles 1998). Recall that overdispersion substantively implies that the model fails to account for why some subjects have greater or fewer total arrest events than other subjects, or stated differently, there is more heterogeneity in the outcome variable *among the individuals* than would be expected according to the probability distribution.

In the presence of panel data, the negative binomial model specified here is often referred to as a ‘naïve pooling’ model (Burton, Gurrin, and Sly 1998; Hardin and Hilbe 2001) because it *naively* treats the panel data set as a *pooled* sample consisting of N*T independent individuals rather than as simply Ν independent individuals each with Τ (or T_{i} if unbalanced) dependent observations (Hamerle and Ronning 1995). For example, the negative binomial model expressed above treats the extra variation as resulting purely from transient stochastic variation, rather than allowing a main component of the extra variation to be caused *explicitly* by the stochastic dependence between the observations within the Ν individuals (Cameron and Trivedi 1998; Dean 1998; Lindsey 1993). Within each individual, each ‘draw’ (i.e. for a given ‘age’ record) of the random effect from the gamma distribution is completely independent of the other draws for that individual (i.e. for the other ‘age’ records of that individual). Stated more emphatically, the standard version of the negative binomial model (that ignores the panel structure of the data) does not control for persistent
(p.277)
unobserved heterogeneity, or as Hsiao (1986: 158) notes, ‘statistical models developed for analyzing cross-sectional data essentially ignore individual differences.’ The stochastic variation of the event counts for each individual's panel records is viewed as having been generated by chance—there is no serial dependence of the individual's records. A ‘shortcoming of the negative binomial model is that it does not allow for firm [individual] specific effects so that serial correlation of the residuals (i.e., non-independence of the counts) may be a problem’ (Hausman, Hall, and Griliches 1984: 922–3).

This failure to correct for the dependence among the observations is particularly critical for the topic of this study because the ‘naïve pooled’ negative binomial model completely ignores two possible sources of overdispersion in the data—population heterogeneity and state dependence—because it ignores the serial dependence within the data (Cameron and Trivedi 1998; Dean 1998; Hsiao 1986; Lindsey 1993, 1995). For example, if ‘individuals who have experienced an event in the past are more likely to experience the event in the future than are individuals who have not previously experienced the event,’ then this will induce overdispersion in the data (Heckman 1981 *b*: 91). In such cases the key question is whether this overdispersion (which is a consequence of the serial dependence) is the result of a process of contagion/state dependence, population heterogeneity, or possibly both of these processes.^{8} These two sources of overdispersion are the same two explanations discussed in Chapter 2 as fundamentally critical to our study (because they are rival hypotheses in explaining the relationship between past and future criminal activity). As Hamerle and Ronning (1995: 411–12) state, ignoring ‘heterogeneity among cross-sections or time-series units [individuals] could lead to inconsistent or meaningless estimates of the structural parameters … controlling for heterogeneity is in most applications a means to obtain consistent estimates of the systematic part of the model.’

For example, suppose in the pooled negative binomial model one were to find a significant association between a binary indicator of arrest at the previous age and arrest at the current age. In the standard negative binomial model, the process underlying this significant association would be indeterminable because the effects of persistent individual differences are left uncontrolled in this model. Lindsey (1993: 157) pointedly remarks, ‘if a missing variable [underlying criminal propensity] can be assumed constant (p.278) over all events on a unit, but differs among units, this will yield stochastic dependence among the events on each unit,’ and this missing variable will, in fact, ‘induce an effect identical to apparent contagion’ or state dependence.

It has been shown, however, that the unique structure of panel data can be exploited to investigate the above two critical sources of serial correlation of an outcome variable across waves or periods (Cameron and Trivedi 1998; Hamerle and Ronning 1995; Heckman 1981 *a*; Hsiao 1986; Powers and Xie 2000). For example, in an early study investigating whether population heterogeneity or state dependence was driving overdispersion in accident data, Neyman (1965: 6) noted that the distinction between these two processes would be possible if ‘one has at one's disposal data on accidents incurred by each individual separately for two periods of six months each.’ Thus, with multiple waves or periods of data on a set of individuals, models can be estimated that attempt to disentangle the effects of population heterogeneity from those of state dependence by specifically incorporating sources of ‘hidden’ or unobserved heterogeneity into the model specification.

# Accounting for Serial Dependence: Persistent Individual Differences

In this section, we discuss the two most common methods that are used to control for persistent individual differences in panel data: parametric random effects models and semiparametric random effects models.^{9} Before presenting the technical aspects of each formulation, we first broadly compare the two different methods on the basis of how each model accounts for persistent unobserved heterogeneity.

In the parametric random effects model, the error term is specified to be composed of two components (Hamerle and Ronning 1995; Heckman 1981 *a*; Hsiao 1986; Nagin and Paternoster 1991):

where(16)

*α*

_{i}represents a persistent (time-stable) individual-specific component that is assumed to follow a specific parametric distribution and

*u*

_{it}is a stochastic component that follows some specified parametric distribution. Thus, the parametric random effects model assumes that the persistent (p.279) unobserved heterogeneity follows a known, mathematically tractable parametric distribution that is specified by the user.

The semiparametric random effects model, on the other hand, makes no parametric assumption about the distribution of unobserved heterogeneity, but rather this method non-parametrically approximates the distribution of persistent unobserved heterogeneity via a set of discrete ‘points of support’. The method only assumes that the distribution of unobserved heterogeneity can be approximated by a discrete multinomial probability distribution (Heckman and Singer 1984; Land, McCall, and Nagin 1996; Nagin 1999; Nagin and Land 1993).

Graphically speaking, this is an example of the classic approach in statistics of approximating a continuous distribution with a discrete distribution. This is often shown by taking a continuous distribution and graphing that distribution using a histogram (which uses a discrete number of ‘bins’ or bars). Using this analogy, the ‘points of support’ would be the histogram ‘bins’ propping-up the continuous distribution. Nagin and Land (1993), Land, McCall, and Nagin (1996) and Nagin (2004) all contain graphical representations of this process.

The distinction between these two methods can be viewed in the light of the tension in statistics that is ever-present between ‘parametric’ and ‘non-parametric’ methods (Bushway, Brame, and Paternoster 1999). Parametric methods are more restrictive methods, but if the parametric assumption is appropriate *in the population*, then this method of estimation will be more statistically efficient (i.e. it will have less variance from sample to sample). The non-parametric methods are less efficient if the true distribution is a (mathematically tractable) known continuous distribution, but since these methods do not assume that the mixing distribution follows a restrictive mathematical parametric form *a priori*, they can approximate any continuous distribution regardless of its shape.

As Nagin and Tremblay (1999: 1188) note, ‘the cost of approximation is obvious. Approximations are just that—there is a loss of accuracy. Balanced against this are gains in generality and flexibility. Generally we have no empirical or theoretical basis for specifying the distribution of the growth curve parameters [random effects] within the population.’ The choice of a parametric mixing distribution is generally made purely on the fact that the some distributions (e.g. conjugate distributions) make the model more mathematically tractable because they ensure that the marginal density of such models have a closed form solution (Cameron and Trivedi 1998). For example, in the standard (single record per individual) Poisson model, the gamma distribution is the conjugate distribution that allows for a closed form solution to the negative binomial regression model. Although in some situations the available mathematically tractable mixing distribution makes substantive sense, while in other cases this is
(p.280)
unlikely to be true. Indeed, this was the precise reasoning of the thought behind the development of finite mixture models: a particular mixture distribution does not have to be used simply because it is mathematically tractable. The discrete mixture methods allow the data to *speak for themselves*with respect to the nature and extent of unobserved heterogeneity.

# Parametric Random Effects Negative Binomial Model

The parametric random effects specification of the negative binomial model was first presented by Hausman, Hall, and Griliches (1984), who specified it as

This specification differs from the negative binomial specification of equation (7) in the decomposition of the error term, which in the standard model is specified as(17)

*ε*

_{it}=

*u*

_{it}. In the random effects formulation of the negative binomial model, the decomposition of the error term results in one component,

*α*

_{it}representing the fixed, individual-specific component, and one component,

*u*

_{it}, representing the transitory stochastic variation. Substantively this model allows for randomness both between-individuals and within-individual across time (or age) (Hausman, Hall, and Griliches 1984: 927).

^{10}The random effects negative binomial model yields a negative binomial model for the

*i*

_{th}group that has constant dispersion within the

*i*

_{th}group, but the dispersion varies randomly between groups. According to this model,

Further, in this the model the ratio 1/(1 +(18)

*δ*

_{i}) is assumed to be randomly distributed according to the beta distribution, with the

*r*and

*s*parameters of the beta distribution,

*B(r, s),*estimated from the data at-hand.

^{11}This results in the following joint probability of events counts for the

*i*

_{th}group (p.281) (Hausman, Hall, and Griliches 1984):

where(19)

(20)

(21)

*n*

_{i}= T

_{i}or max(

*t*) for the

*i*individual, and the log-likelihood for equation (21) is (StataCorp. 2001: 393)

^{th}(22)

The beta distribution is a flexible distribution because it has two parameters (Greene 2000), but it should be remembered that the beta distribution is used in the negative binomial random effects model precisely because it produces a mathematically tractable expression that allows the unobserved random effects to be integrated out without encountering serious numerical complications. Instead of assuming that the unobserved heterogeneity is distributed according to the beta distribution, we next discuss the semiparametric formulation whereby a discrete set of non-parametric ‘random effects’ are used to account for unobserved heterogeneity. This method also allows us to investigate the nature of the age-crime relationship within latent classes of serious youthful offenders.

# The Semiparametric Mixed Poisson Model

Before describing the technical aspects of the semiparametric mixed Poisson model, we first present a non-technical discussion of the semiparametric mixed Poisson model.^{12} In brief, the model assumes that the distribution of
(p.282)
unobserved heterogeneity can be ‘segmented’ into a finite number of discrete groups—each of the groups are internally homogeneous with respect to the nature of the unobserved heterogeneity within the group, but there is significant heterogeneity between the groups. According to Nagin and Land (1993) (see also Land, McCall, and Nagin 1996), the simplest specification of this model is

where(23)

**β**

_{0}is the overall constant of the model, ε̅

_{j}is a constant term that is specific to the

*1*

^{th}discrete group or latent class (

*j*= 1, 2,…,

*J*), X

_{it}if is a matrix of measured covariates on individual

*i*at time

*t*, and

**β**is a column vector of regression coefficients.

^{13}Cameron and Trivedi (1998: 129) refer to this model as a

*random intercept model*because each latent class has a separate constant or intercept parameter assigned to it. The effects of the regression coefficients are constrained to be equal across the groups, and the latent classes differ only with respect to their ‘location parameter’. Nagin and Land (1993) describe this model as producing ‘constant shifts’ in the mean rate. That is, the trajectories of each class are identical in shape, but they differ in the mean location of the trajectory. It bears noting that this is the precise specification that corresponds to Gottfredson and Hirschi's hypothesis concerning the robust nature of the relationship between age and crime—groups differ on their mean offending rate, but the actual shapes of the trajectories are identical.

The bare essence of this finite mixture model is that the finite number of intercept coefficients—there is a separate intercept coefficient for each group or ‘point of support’—represent the ‘average contribution’of persistent unobserved heterogeneity to the expected Poisson rate for ‘individuals possessing levels of unobserved heterogeneity in the immediate vicinity of the *j* ^{th} point of support’ (Nagin and Land 1993: 338). This model was subsequently referred to as semiparametric in nature because it ‘combines a parametric specification of the regression component of the model with a non-parametric specification of the error term’ (Land and Nagin 1996: 170).

Nagin and Land (1993) also present a more general form of the mixture model

where(24)

**β**

^{j}is a vector of group-specific regression coefficients. By permitting the regression coefficients to vary across the latent classes, this model allows for full heterogeneity not just in the ‘location’, but also in the nature (p.283) of each latent class's offending trajectory over time. It is also possible for some of the regression coefficients to be constrained so that they are equal across the latent classes, while also simultaneously allowing other coefficients to vary across the latent classes. Wedel

*et al*. (1993) refer to this type of model as a model with random effects in both the intercept and slope parameters. For example, consider the case where the X matrix contains two variables: age and age-squared. By permitting the age coefficients (i.e. growth parameters) to vary across the latent classes, this specification of the mixed model not only allows for heterogeneity in the mean event rate at a given time

*t*, but also for the developmental shape of the trajectories (Nagin 1999; Nagin and Tremblay 1999).

Before concluding this discussion of these finite mixture models, it may be helpful to discuss briefly the more technical side of these models.^{14} Let us begin by denoting individual *i's* longitudinal offending sequence as the vector Y_{i} = [*y* _{i1}, *y* _{i2}, …, *y* _{iTi}], and allow *m* _{j} to denote a random variable indicating the proportion of the data set estimated to belong to the *j* ^{th} point of support. The random variable m_{j} is postulated as a draw from a ‘super-population’—the super population is an additive ‘mixture’ of *J* discrete populations (Cameron and Trivedi 1998: 128). It is important to note that all of the model parameters in the finite mixture model are jointly estimated, including the proportion of the data set that is estimated to belong to the *j* ^{th} point of support. The estimated proportion belonging to each latent class is calculated using the logit function

where the following constraints are imposed:(25)

*m*

^{j}≥ 0 and.

*Σjm*

^{j}7 = 1. The probability of observing y

_{it}events for individual

*i*in time period

*t*in group

*j*is

where(26)

*f*(●) is the Poisson density function, and the probability of the entire sequence of individual

*i*at the

*j*

^{th}point of support is (Land

*et al*. 1996)

Now the unconditional probability of observing individual(27)

*i's*longitudinal sequence can be calculated by aggregating the likelihood function (p.284) (i.e. aggregating the conditional likelihoods) for individual

*i*over the

*j*points of support according to

which is simply the probability of(28)

*Y*at the each of the

_{i}*j*

^{th}point of supports multiplied by the proportion of the population estimated at that point of support and then summed over the

*J*points of support (Land, McCall, and Nagin 1996). The log-likelihood function of this model is

(29)

Of course, a key issue related to the finite mixture model concerns the number of points of supports/latent classes/groups to include in the mixture (D'Unger *et al*. 1998).^{15} In other words, how does one decide how many points of support to include in the model? Unfortunately, a finite mixture model with *J* = 2 mixture components is not nested in the model with *J* = 3 components, and therefore a likelihood ratio test statistic cannot be used to determine the number of components in the mixture distribution ‘because there is not a unique way of obtaining the null hypothesis from the alternative hypothesis’ (D'Unger *et al*. 1998; Ghosh and Sen 1985; Land, McCall, and Nagin 1996: 424; Nagin 1999; Titterington, Smith, and Makov 1985). More specifically, one is prevented from calculating the appropriate degrees of freedom for the likelihood ratio test.

Given this problem with the likelihood ratio test, alternative methods of determining the number of mixture components have been recommended. For example, D'Unger *et al*. (1998) propose the use of the Bayesian Information Criterion (BIC) statistic as a basis of selecting the appropriate number of groups in the mixing distribution (see also Nagin 1999). The BIC statistic is calculated as:

where log((30)

*L*) is the model log-likelihood value,

*N*is the sample size, and

*k*is the number of estimated parameters. The BIC statistic rewards parsimony because each additional fitted parameter results in a ‘penalty’ proportional to the log of the sample size (Kass and Raftery 1995; Raftery 1995; Nagin 1999). In short, the BIC statistic follows the principle of ‘Occam's Razor’ and values parsimony; nonetheless, the substantive goal (p.285) of the BIC statistic is the same as the likelihood ratio statistic—find the best model with the fewest number of parameters. The model that has the least negative value of the BIC statistic is the favored model (Nagin 1999).

The selection of the optimal number of points of supports for the mixing distribution is complicated by the fact that mixture models are known to often have a problem with ‘local solutions’ (i.e. a unique global solution is not always achieved) (Cameron and Trivedi 1998; Goodman 1974; Vermunt and Magidson 2000; Wedel and Kamakura 1998). This issue concerns whether the likelihood function is unimodal or multimodal—it is possible for the model's algorithm to converge to a local maximum, which is not a true global solution. Simulations by Wedel and Kamakura (1998) indicate that the probability of a local solution increases (1) as the number of mixture components increases; (2) as the number of parameters estimated increases (e.g. including a large number of covariates); (3) when the mixture components are not well separated, resulting in *weak identification* of the model (i.e. the model is ‘overparameterized’ and the estimated groups are not that dissimilar); and (4) when using the Poisson and binomial probability distributions. Several authors have recommend testing for the presence of local solutions through the use of multiple sets of starting values in order to test for the presence of local solutions (see, e.g., Cameron and Trivedi 1998, Greene 2000; Vermunt and Magidson 2000; Wedel and Kamakura 1998). In this study, we follow the recommendations of D'Unger et al. (1998) and Nagin (1999) and guide the selection of the optimum number of mixture components using the BIC statistic. We also undertake extensive testing for the presence of local solutions.

While the finite mixture model is a method of controlling for unobserved heterogeneity, depending on the goals of a particular analysis, each individual can be ‘assigned’ to the latent class to which the individual has the highest probability of belonging. Analyses can then be conducted using either the latent class membership variables (the set of *J* binary variables indicating whether the individual was assigned to the *j* ^{th} latent class) or within each of the ‘latent classes’ (i.e. using only the individuals assigned to a particular latent class) (Nagin and Paternoster 2000). For example, graphs of the offending trajectories depicting the relationship between age and crime within each of the latent classes could then be computed. Another alternative is to include the set of *J* binary variables in regression models as variables that control for persistent unobserved heterogeneity (see Laub *et al*. 1998 for an example).

Thus, often times one of the key steps of the application of finite mixture models is to sort the individuals in the sample into the latent class for which they have the highest probability of belonging. As shown in Nagin (1999),
(p.286)
the posterior probability of membership in the *j* ^{th}latent class for individual *i*is calculated as

This probability represents the estimated probability (based on the model coefficients) of observing individual(31)

*i's*longitudinal sequence (i.e. their longitudinal record of arrest charges),

*Y*

_{i,}given membership in the

*j*

^{th}latent class, and the estimated proportion of the population in latent class

*j*. These

*J*probabilities are calculated after model estimation based on the maximum likelihood estimates of the model (i.e. on the basis of an

*ex post facto*comparison); heuristically, each individual's observed trajectory is compared against the predicted trajectory of each latent class, and they are then assigned to the latent class that is most similar to their observed trajectory. The posterior probabilities provide substantive information on the precision of group assignment, with probabilities larger in magnitude indicating more precision in the assignment of individuals to a particular latent class. The posterior probabilities ‘are among the most useful products of the group-based modeling approach’ (Nagin 1999: 149) because they allow for the assignment of each individual to a particular latent class, which then allows for simple graphical and tabular analyses of the characteristics of the members of the latent classes and their offending behaviors (Nagin 2004).

## Notes:

(1) This discussion of the Poisson model and generalizations of it draw heavily on the detailed treatments of these methods in Hausman, Hall, and Griliches (1984), Cameron and Trivedi (1986, 1998), Hardin and Hilbe (2001), and especially Land, McCall, and Nagin (1996). Excellent discussions of the statistical issues involved in modelling continuity and change are available in Brame, Bushway, and Paternoster (1999) and Bushway, Brame, and Paternoster (1999).

(2) See Appendix A in King (1988) for a mathematical proof that event count data that meet a few modest assumptions about the data generation process can be shown to arise from a Poisson process.

(3)
The logarithmic *link function* is used to link the linear systematic component, denoted as **Χβ**, to the response variable (Nelder and Wedderburn 1972; McCullagh and Nelder 1989; Hardin and Hilbe 2001) in order to ensure that that event rate is predicted to be non-negative (Land, McCall, and Nagin 1996).

(4) Hardin and Hilbe (2001: 128) show the mean (first derivative) and variance (second derivative) functions of the Poisson distribution are identical.

(5) The gamma distribution is the ‘conjugate distribution’ for the Poisson distribution, which allows for a closed form solution that is analytically tractable (Lindsey 1995). Assuming the heterogeneity is normally distributed does not lead to an analytically tractable solution (Cameron and Trivedi 1998; Land, McCall, and Nagin 1996).

(6) The negative binomial models presented in Chapter 8 were estimated using both the NB1 and NB2 specifications. The NB1 specifications always had larger log-likelihood values, and thus the NB1 versions are the ones presented in Chapter 8. It should be noted, however, that the NB2 models generated identical substantive conclusions to those reached with the NB1 models.

(8)
This issue has been raised not only in studies of criminal behaviour but also in studies of accidents (Bates and Neyman 1952; Greenwood and Yule 1920), unemployment (Heckman 1981 *b*; Heckman and Borjas 1980), bouts of schizophrenia (Kessing, Olsen, and Andersen 1999) and emotional distress (Fischer *et al*. 1984; Robins 1966, 1978), and Medicare claims for Alzheimer's/dementia among the elderly (Taylor, Fillenbaum, and Ezell 2002).

(9) Fixed effects estimators are not included here or employed in this study because fixed effects models with lagged values of the outcome variable prohibit inclusion of time trends or age effects (Bushway, Brame, and Paternoster 1999; Cameron and Trivedi 1998; Hamerle and Ronning 1995). Due to the incredibly strong relationship between age and crime (see Chapter 7), these models are clearly inappropriate for modeling crime data and therefore are not considered here.

(10) The Poisson random effects model for panel data, which generally assumes gamma distributed heterogeneity, only allows for the individual-specific component (which accounts for between-individual differences) (Cameron and Trivedi 1998; Hamerle and Ronning 1995; Hausman, Hall, and Griliches 1984).

(11)
This model was estimated using Stata Version 7 (StataCorp. 2001). The distribution of dispersion (noted here) programmed into Stata is the inverse of the Hausman, Hall, and Griliches (1984) method, which is just a technical preference of StataCorp. Regardless of the whether *δ* is estimated using the parameterization employed by Hausman, Hall, and Griliches (1984), or the inverse parameterization employed in Stata, the resulting solutions are identical.

(12) It should be noted here that while this discussion centres on the Poisson version of the finite mixture model, the finite mixture model is a general class of models that extends far beyond the formulation of the model on the basis of the Poisson distribution. Finite mixture models can be estimated for any distribution in the exponential or multivariate exponential family (see Nagin 1999; Vermunt and Magidson 2000; and Wedel and Kamakura 1998).

(13) The latent classes or groups are also commonly referred to as the mixture ‘components’ (Cameron and Trivedi 1998) and ‘segments’ (Wedel and Kamakura 1998).