## Theodore R. Holford

Print publication date: 2002

Print ISBN-13: 9780195124408

Published to Oxford Scholarship Online: September 2009

DOI: 10.1093/acprof:oso/9780195124408.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2017. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in OSO for personal use (for details see http://www.oxfordscholarship.com/page/privacy-policy). Subscriber: null; date: 25 February 2017

# (p.377) Appendix 5 Theory on Regression Models for Proportions

Source:
Multivariate Methods in Epidemiology
Publisher:
Oxford University Press

The approach for fitting generalized linear models is based on work by Nelder and Wedderburn (1) and McCullagh and Nelder (2), which provides a powerful synthesis of regression methodology, thus unifying various techniques that had previously been considered separately. The elements of a generalized linear model that need to be identified in order to obtain maximum likelihood estimates are the distribution of the response and the relationship between the mean, μ, and the linear predictor, η, which introduces the regressor variables into the analysis. These elements are then used to derive the necessary information for statistical inference on the model parameters.

# Distribution for Binary Responses

As we saw in Chapter 3, the total number of responders in a group of n individuals with a common probability of the outcome has a binomial distribution in which

(A5.1)
One of the required elements for fitting a generalized linear model is the variance of the response, expressed as a function of the mean, which in this case is given by
(p.378) The likelihood, ℓ(μ), is given by the probability of observing the response. If we have I independent samples, then
and the log likelihood is
Maximum likelihood estimates can be found by solving the normal equations, which were formed by setting the partial derivatives of L with respect to the model parameters to zero. For models belonging to the exponential class, it is convenient to employ the chain rule (2):
(A5.2)
If all of the parameters are the regression terms in the linear predictor, then we can conveniently carry the chain rule one step further, by noting that
where x ir is the rth regressor variable for the ith subject.

A final term that is often used when formulating likelihood-based inference for binary responses is the deviance, which is defined as twice the logarithm of the ratio of the likelihood evaluated at the observed response to that of the expected response. For the ith subject this becomes

and the overall deviance is found by summing over I. Maximizing the likelihood is equivalent to minimizing the deviance, but the deviance is especially useful in the present context because when n i is large, it can be interpreted (p.379) as an overall goodness of fit statistic, which can be compared to a chi-square distribution with Ip df, where p is the number of parameters in our model.

# Functions of the Linear Predictor

The relationship between the proportion responding and the linear predictor is defined by the link function, η = g(π), which gives the transformation of the probability of the response that leads to the linear portion of the model. It is also necessary to be able to reverse the process by determining π as a function of the linear predictor, π = g −1(η), called the inverse link. Some software also requires that we specify the derivative of the linear predictor with respect to π,

Table A5–1 gives some useful link functions along with their corresponding inverses and derivatives. Some of these are commonly included in software as built in options, including logit, complementary log-log, and probit.

Example AS–1. Suppose we are interested in fitting a model in which the odds for disease is a linear function of the regression parameters. Let π represent the

Table A5–1. Commonly used link functions, inverse links, and derivatives for binary response data

Model

Derivative

Logit

Complementary log-log

log[−log(1 − π)]

(1 − exp[−exp{η}])

Probit

Ф −1(π)

n·Ф(η)

Linear odds

Power odds

(γ ≠ 0)

(γ = 0; see Logit)

(p.380) probability that disease occurs, and let the response, Y, represent the number of cases that occur out of n observed in a particular risk group. Hence, the mean response is μ = nπ, and the link function is given by
This can be rearranged to give the inverse link
and the derivative of the link function is given by

# Using Results to Conduct Inference

Typical output from statistical software includes computations involving the functions defined here, which are evaluated at various values of the underlying model parameters. Let us now consider how these elements are typically used in conducting statistical inference.

First, we are given the maximum likelihood estimates of the model parameters, β, along with the estimated covariance matrix, calculated from

and the square root of the diagonal elements yield the estimated standard errors of the model parameters, , for the rth parameter. If we apply the chain rule to equation A5.1, then the information matrix is
(A5.3)
The partial derivative of L with respect to πi has expectation zero, so we can drop the second term in brackets, and write the information matrix as
(p.381) Perhaps the most familiar way in which the covariance matrix can be used is in constructing a Wald test
which is compared to a standard normal deviate, or its square, W 2, which is compared to a chi-square distribution with 1 df. A generalization of this test can also be constructed if we wish to consider a p * vector of parameters containing a subset of all parameters in the model, ,
which is compared to a chi-square distribution with p * df.

An alternative approach for constructing a significance test is to use the log likelihood or the scaled deviance to construct a likelihood ratio test of whether a set of regression parameters has no effect, H 0: β = 0, which yields the fitted mean, . Typically, we are given the log likelihood, so that the test is given by

An equivalent test can be constructed using the scaled deviance, which is often given as an alternative to the log likelihood:

Finally, we consider the construction of interval estimates of the model parameters. The simplest approach is to make use of the parameter estimates and their standard errors, so that the 100(1 − α)% confidence interval for the pth parameter is

which are by definition symmetric about the parameter estimates. This approach works well for large samples in which the distribution of the maximum (p.382) likelihood estimates are well approximated by normal distributions. We can also see that this approach is similar in spirit to the Wald test, so we might also consider an analogous procedure that reminds us of the likelihood ratio statistic, by finding the value of the parameter that reduces the maximum of the log-likelihood by the half the corresponding critical value of the chi-square distribution, . In order to construct such a confidence interval consider the profile likelihood in which we fix one or more parameters and maximize the likelihood for the remaining parameters. We can express the resulting linear predictor as
where the subscript −p indicates that the pth regressor variable has been dropped from the linear predictors. Notice that the form for the linear predictor is the same, only now, x p βp is a specified constant, or offset. The inverse link function yields the linear predictor, μp, which, in turn, gives us the profile log likelihood, L *p), and the confidence limits are found by solving
for βp, which typically has two solutions giving the lower and upper limits.

A model is always proposed as a possible candidate for describing a set of data, hence, it is important that we be sure that it provides an adequate description of the data. One way in which this can be accomplished is to conduct a goodness of fit test. When the number of subjects at each of the I independent samples is reasonably large, say about 10 or more, then we already noted that the deviance can be used for this purpose, comparing it to a chi-square distribution with Ip df, where p is the number of parameters estimated. Alternatively, we can use the Pearson chi-square statistic:

Example A5–2. One approach we might employ when trying to decide among alternative models for the odds for disease is to consider a general family of regression models that includes both the linear odds model and the linear logistic (p.383) model as special cases. Such a family is one in which a power transformation of the odds for disease is a linear function of the regressor variables
(A5.4)
Alternatively, we can express the probability of disease by
(A5.5)
If the parameter γ is known, then this reduces to a generalized linear model in which the only parameters to be estimated assess the association between the regressor variables and the outcome. However, we can also regard γ as another parameter to be estimated, so that the model is now a conditionally linear model in which the condition depends on this unknown parameter. We can find the maximum likelihood estimate of γ by fitting the model in equation A5.4 for specified values of γ and then searching for the one that maximizes the likelihood.

While this approach will indeed determine the maximum likelihood estimates of the parameters, the variance estimators provided by the software will not take into account the fact that γ was estimated from the data and is not a known constant. Hence, we will need to correct the covariance matrix for the parameters by considering the resulting information matrix, which can be partitioned as

where
and
(p.384) The last row and column contain the covariance term involving γ, and A is the information matrix for the regression parameters that are part of the linear predictor. As the software is treating the maximum likelihood estimate of γ as fixed, it provides as an estimate of the covariance matrix, A −1. Using the form for the inverse of a partitioned symmetric matrix given by Rao (3), we can write the co-variance of all estimated parameters as
where E = D − B′·A −1 ·B and F = A −1 ·B. Notice that we must add a correction, F·E −1·F′, to the covariance matrix generated by the software, and as the diagonal elements of this correction are positive, it is clear that if we ignore the fact that γ is estimated, then we will always underestimate the standard errors of our estimates of association with the response.