# Prediction of Clinical Outcomes from Genome-wide Data

# Prediction of Clinical Outcomes from Genome-wide Data

# Abstract and Keywords

Prognosis is an essential tool in medicine for estimating the likely outcomes of a disease and for preventing it. The traditional approach relies on measuring physiological and environmental parameters. With the recent availability of genome-wide data, it is now possible to incorporate the genetic information for predicting complex diseases. Probabilistic graphical models are well-known for their efficiency in predictive issue and thus represent good candidate models in this context. The probabilistic graphical model framework provides many assets: data uncertainty modeling, fast probabilistic inference algorithms, easy incorporation of expert knowledge and good predictive performance.

*Keywords:*
clinical outcomes, genome-wide data, naive Bayes model, Bayesian model averaging, ROC curve

*Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics.* First Edition. Christine Sinoquet & Raphaël Mourad (Eds).© Oxford University Press 2014. Published in 2014 by Oxford University Press.

Prediction is a key component of clinical care, including individual risk assessment, diagnosis, prognosis, and selection of therapy. Probabilistic models of various types have been developed for making predictions from clinical data. Predictions using genome-wide data have the potential to improve clinical care. This chapter describes a probabilistic inference algorithm for prediction of clinical outcomes from genome-wide data by efficiently averaging over a large number of models. Bayesian model averaging (BMA) is the standard Bayesian approach wherein the prediction is obtained from a weighted average of the predictions of a set of models, with better models influencing the prediction more than others. The model-averaged naive Bayes (MANB) algorithm described here predicts clinical outcomes from genome-wide association study (GWAS) data by averaging over the predictions of all 2^{n} naive Bayes models, weighted by the posterior probability of each model. The MANB approach is then evaluated on a genome-wide Alzheimer’s disease data set and compared with the naive Bayes method.

# 17.1 Introduction

Prediction is a key component of clinical care, including individual risk assessment, diagnosis, prognosis, and selection of therapy. Improvements in predictive performance have the potential to significantly improve patient outcomes and reduce healthcare costs. Probabilistic models of various types have been developed for making predictions from clinical data [6].

(p.432) Genome-wide patient-specific data are likely to become available to inform clinical care in the foreseeable future and provide significant opportunities for development of statistical models to improve prediction over what is currently possible from clinical data alone. The sheer magnitude of the number of features in genome-wide data (in the hundreds of thousands) presents formidable computational and modeling challenges. This chapter describes a probabilistic algorithm for prediction of clinical outcomes from genome-wide data by efficiently averaging over a large number of models.

The chapter has the following sections: Section 17.2 identifies some of the challenges in learning prediction models from genome-wide data; Section 17.3 provides related background information; Section 17.4 describes an efficient model-averaged naive Bayes (MANB) algorithm for prediction of clinical outcomes from genome-wide data; Section 17.5 provides details about the evaluation protocol; Section 17.6 provides the results from an evaluation of the MANB algorithm on a genome-wide Alzheimer’s disease data set; and Section 17.7 concludes this chapter.

# 17.2 Challenges with Genome-wide Data

The commonest genetic variation in the human genome is the single nucleotide polymorphism (SNP), when at a specific location on the genome, only two variants (alleles) are observed across individuals in a population. The human genome is estimated to have about 20–30 million SNPs that constitute approximately 0.1% of the genome. With the development of gene chips that can measure a half-million SNPs or more (e.g., the Affymetrix 500K GeneChip), it is now possible to obtain a snapshot of the variation across the entire human genome.

The availability of gene-chip technology has led to a flurry of GWASs. The typical goal of GWASs has been twofold: (1) to identify SNPs (and corresponding genes) that are associated with a trait or disease and (2) to identify molecular pathways that are involved in disease. Thus, GWASs hope to elucidate the genetic causes and mechanisms that underlie traits or diseases. Fewer studies have investigated how well SNPs of an individual predict clinical outcomes about that individual, such as his or her risk of developing a disease. The focus of this chapter is on learning prediction models from GWAS data; this is in contrast to identifying explanatory models of disease that is the typical focus in GWASs.

Developing prediction models from GWAS data presents several challenges. One challenge is the high dimensionality of the data, which makes feature selection difficult. In addition, such data may have only a few strongly predictive features but many weak ones, and identifying all predictive features makes the problem even more challenging. When a subset of features is strongly predictive of the target outcome, then identifying those features will likely result in excellent prediction. However, when there are few or no strong features and many weak features, the effects of all features may have to be combined to achieve good prediction. One approach to feature selection that can adapt to both of these scenarios is model averaging. In model averaging, the final prediction is obtained by averaging the predictions of a number of models that contain different sets of features. In the scenario when there are several strong features, model averaging will behave like standard feature selection; in the other scenario, when there are many weak features, model averaging will aggregate the predictive effects of these features. A second challenge is that prediction models used in healthcare should have both good discrimination and good calibration. Discrimination is the ability to correctly separate the target outcome classes, while calibration measures how closely the predicted probabilities agree with the actual outcomes. It is (p.433) obvious that prediction models should have good discrimination to provide accurate predictions; however, it is less obvious, though equally important, that well-calibrated predictions are needed for making rational decisions that lead to optimal clinical outcomes. A third challenge is computational tractability. Computationally efficient methods are needed to learn highly predictive and well-calibrated models from high-dimensional GWAS data.

This chapter describes a model-averaged naive Bayes (MANB) algorithm that predicts clinical outcomes from GWAS data by performing Bayesian model averaging over an exponential number of naive Bayes (NB) models. MANB averages over the predictions of every possible NB model with a distinct set of features, weighted by the posterior probability of each model. Compared to NB, MANB addresses all three challenges of GWAS data described above. It detects both strong and weak features, has better calibration, and is computationally as efficient as NB.

# 17.3 Background

This section provides background information about the NB model, Bayesian model averaging and Alzheimer’s disease, as the MANB algorithm will be applied to predict Alzheimer’s disease from GWAS data later on in the chapter.

## 17.3.1 The Naive Bayes Model

The naive Bayes is a probabilistic model that makes the simplifying assumption that any feature ${X}_{1},{X}_{2},\dots ,{X}_{n}$ in the set *X* is conditionally independent of any other given the value of the target variable *T*. Thus, for all values of ${X}_{1},{X}_{2},\dots ,{X}_{n}$ and *T*:

Given the prior probability distribution $\mathbb{P}(T)$ and the conditional probability distributions $\mathbb{P}({X}_{i}|T)$, the posterior probability distribution $\mathbb{P}(T|x)$, where *x* is an instantiation of *X*, is obtained by applying Bayes’ theorem:

In equation (17.2), *t* is an instantiation of *T*, and the summation in the denominator is done over all possible values that *T* takes. An example of a small NB model with two features is shown in Fig. 17.1A.

The NB model has been used widely for prediction and classification in many domains because (1) it is learned efficiently from data; (2) it is compact, requiring modest amounts of memory; (3) it performs rapid predictions; and (4) it often has good discrimination in practice. The main disadvantage of the NB model is that its predictions are often miscalibrated, and the miscalibration is often accentuated when there are large numbers of features. Due to this problem, an NB model that is learned from high-dimensional GWAS data is likely to make predictions with posterior probabilities that are very close to 0 or 1.

## (p.434) 17.3.2 Bayesian Model Averaging

Typically, methods that learn prediction models from data perform model selection in which a single good model is selected that summarizes the data well. This model is then used to make future predictions. However, given finite data, there is uncertainty in choosing one model to the exclusion of all others, and this can be especially problematic when the selected model is one of several distinct models that all summarize the data more or less equally well. A coherent approach to dealing with the uncertainty in model selection is Bayesian model averaging (BMA). BMA is the standard Bayesian approach wherein the prediction is obtained from a weighted average of the predictions of a set of models, with better models influencing the prediction more than others. There is also a biological rationale for the model-averaging approach. In GWASs, a fundamental problem is genetic heterogeneity due to some mutations affecting only some patients. Model averaging will likely include the effects of such variants that are found only in a subpopulation.

Theoretically, BMA is expected to have better predictive performance than any single model as described by [11]. This result is supported empirically by a wide variety of case studies [12]. For example, [17] applied BMA to select genes from DNA microarray data to predict prognosis in breast cancer, differentiate between two leukemia subtypes, and distinguish among three types of hereditary breast cancer and showed that BMA identified smaller numbers of relevant genes, with comparable prediction accuracy to other methods that identified larger numbers of genes. In addition, [8] provides several clinical case studies in which BMA performed better than various types of model selection. A good overview of BMA in general is provided in [8] and of BMA in the context of Bayesian networks is provided in [10].

## 17.3.3 Alzheimer’s Disease

Alzheimer’s disease is the commonest neurodegenerative disease and the commonest cause of dementia associated with aging [7]. Alzheimer’s disease is characterized by adult onset of progressive dementia that typically begins with subtle memory failure and progresses to a variety of cognitive deficits like confusion, language disturbance, and poor judgment.

Alzheimer’s disease is divided into early-onset familial Alzheimer’s disease, in which the disease begins before 65 years of age, and late-onset Alzheimer’s disease (LOAD), in which it begins at 65 years of age or later [1]. Early-onset Alzheimer’s disease is a rare disease, and its genetic basis is well established. Most cases of early-onset familial Alzheimer’s disease are caused by mutations in one of three genes: amyloid precursor protein gene, presenelin 1, or presenelin 2.

LOAD is the more common form of Alzheimer’s disease and is widespread, affecting almost half of all people over the age of 85 years. LOAD has both genetic and environmental factors, and its genetic basis is more complex than that of early-onset Alzheimer’s disease. Elucidating the role of genetic variants in the pathogenesis and development of LOAD has been a major focus of LOAD GWASs. The APOE SNP rs429358 is the most well-known SNP associated with increased risk of developing LOAD. In addition, in the past several years, GWASs have identified several other genetic loci associated with LOAD. The AlzGene website is a comprehensive and updated resource of Alzheimer’s disease genetic studies and provides an updated list of LOAD-associated SNPs identified by GWASs [2].

# (p.435) 17.4 The Model-Averaged Naive Bayes (MANB) Algorithm

The model averaged naive Bayes (MANB) algorithm averages over the predictions of every possible NB model with a distinct set of features, weighted by the posterior probability of each model. The MANB algorithm was initially described by [5] and was later applied to genomic data by [15]. An overview of the algorithm is provided first, followed by a more detailed description.

## 17.4.1 Overview of the MANB Algorithm

Inference with a NB model with features *X* and target *T* consists of deriving $\mathbb{P}(T|x)$ for a test instance with feature values *x* and is given by equation (17.2). The MANB algorithm derives $\mathbb{P}(T|x)$ by model averaging over all 2^{n} NB models, where *n* is the cardinality of *X* (i.e., the total number of features in the data set). For example, in Fig. 17.1, *n* equals 2, and there are 2^{2}, that is, four models over which MANB averages.

The simple example shown in Fig. 17.1 is now used to illustrate Bayesian model averaging. BMA is based on the notion of averaging over a set of possible models and weighting the prediction (inference) of each model according to its probability given training data set *D* where $D=X\cup T$. The model-averaged prediction for a test instance with feature values *x* is given by the following equation:

Consider the four NB models on two features *X*_{1} and *X*_{2} in Fig. 17.1, and suppose that, given *D*, the models *a*, *b*, *c*, and *d* are assigned probabilities of 0.5, 0.1, 0.3, and 0.1, respectively. Suppose further that for a test instance with feature values $x=\u3008true,false\u3009$, the models *a*, *b*, *c*, and *d* predict *T* = *true* as 0.9, 0.5, 0.8, and 0.7, respectively. Then, according to equation (17.3), the model-averaged estimate of $\mathbb{P}(true|\u3008true,false)\u3009$ is $(0.5\times 0.9)+(0.1\times 0.5)+(0.3\times 0.8)+(0.1\times 0.7)=\text{}0.81$.

As the number of features increases, the number of NB models increases exponentially. For example, for *n* equal to 100, 2^{100} is close to 10^{30}, which is far too many models to average over in an exhaustive way. This implies that it is not feasible to perform model averaging by explicitly performing inference with each NB model and averaging them to obtain $\mathbb{P}(T|x)$ as illustrated in the above example.

Fortunately, the independence relationships inherent in the NB models allow considerably more efficient model averaging. In the case of Bayesian networks, Buntine describes how to use a single conditional probability distribution to compactly represent the model-averaged relationship between a child node and its parent nodes [3]. In the NB model, each feature is a child node, with a single parent node, which is the target. Using Buntine’s compact representation, [5] explains how it can be used to efficiently perform model averaging over all NB models on a set of features.
(p.436)
By using this method, MANB inference becomes linear in the number of features and is as efficient as NB inference. Thus, rather than requiring time on the order of *O*(2^{n}) to perform inference using model averaging, it only requires time of the order of *O*(*n*).

The efficiency obtained by MANB inference is based on several assumptions including multinomial features, no missing values in the training data set, parameter independence, and structure modularity. Further details about these assumptions are provided in Appendix 17.8 (page 440).

## 17.4.2 Details of the MANB Algorithm

This subsection provides a set of equations that describe the MANB algorithm. It first describes the main inference task and then successively decomposes it into its constituent parts.

Let *X* denote a set of *n* discrete-valued features, namely $\{{X}_{1},{X}_{2},\dots ,{X}_{n}\}$, and let *x* denote the values (an instantiation) of the features in *X* in a test instance. Suppose feature *X*_{i} has *r*_{i} possible values that are coded by the integers from 1 to *r*_{i}. Then *r*_{i} is the dimensionality of *X*_{i}. For example, *X*_{i} could represent a SNP that has three genotype values, and those values could be encoded as 1, 2, and 3. Let *x*_{i} be the value of feature *X*_{i} in an instance. Let *T* denote the discrete-valued target variable to be predicted and let *r*_{T} denote the dimensionality of *T*. For example, *T* could represent a disease such as LOAD that exhibits the value *absent* or *present* that could be encoded as 1 and 2. The inference task is to compute a model-averaged posterior probability for each value *t* of *T* conditioned on *x*.

The following equation is obtained by applying Bayes theorem:

The subscript *a* in equation (17.4) denotes a model-averaged probability. Assuming that the features are conditionally independent of each other, given the value of the target *T*, the term ${P}_{a}(x|t)$ is factored as:

Each of the terms in equation (17.5) is estimated using a training data set and prior probabilities that are described below. Let the training data set *D* where $D=X\cup T$ contain *m* instances, and let *D*_{i} denote the part of the data set *D* that contains the values for just feature *X*_{i} and target variable *T*. In [5], it is proven that BMA over all 2^{n} NB models is equivalent to using the following value for each term in equation (17.5):

where $T\to {X}_{i}$ denotes that *T* and *X*_{i} are probabilistically dependent and $T\dots {X}_{i}$ denotes that they are independent. When they are dependent, the conditional probability $\mathbb{P}({x}_{i}|t,{D}_{i})$ is used to estimate ${\mathbb{P}}_{a}({x}_{i}|t)$. When they are independent, $\mathbb{P}({x}_{i}|{D}_{i})$ is used. Equation (17.6) can be viewed as using model averaging (regarding whether a relationship between *T* and *X*_{i} is present or not) to provide smoothing of the probability ${\mathbb{P}}_{a}({x}_{i}|t)$ that is being estimated by equation (17.6). This smoothing is in addition to the smoothing that will be done in estimating $\mathbb{P}({x}_{i}|t,{D}_{i})$ and $\mathbb{P}({x}_{i}|{D}_{i})$ (see below), which also appears in equation (17.6).

(p.437)
Once ${\mathbb{P}}_{a}({x}_{i}|t)$ has been derived for each value *x*_{i} (of feature *X*_{i}) and each value *t* (of target *T*), those probabilities can be used in equations (17.5) and (17.4) to calculate the posterior probability distribution of *T* conditioned on any instantiation *x* of *X*.

The derivation of each of the terms in equation (17.6) is described after introducing some notation. Let *N*_{ijk} denote the number of times in *D*_{i} that feature *X*_{i} has the value *k* when target *T* has the value *j*. To keep the notation simple, assume that *x*_{i} equals the value *k* and *t* equals the value *j*. Let ${N}_{ij}=\sum _{k=1}^{{r}_{i}}{N}_{\mathit{\text{ijk}}}$ and let ${N}_{i}=\sum _{j=1}^{{r}_{T}}{N}_{ij}$. Note that for all *i*, *N*_{i} equals *N*, where *N* is the total number of instances in *D*. Finally, let ${N}_{i\ast k}=\sum _{j=1}^{{r}_{T}}{N}_{\mathit{\text{ijk}}}$.

The term $\mathbb{P}({x}_{i}|t,{D}_{i})$ is computed using the Bayesian approach. A uniform prior distribution is specified for $\mathbb{P}({x}_{i}|t)$ which is updated (using the training data) to obtain a posterior distribution for $\mathbb{P}({x}_{i}|t,{D}_{i})$. The mean of the posterior distribution is the Bayesian estimate for the term $\mathbb{P}({X}_{i}=k|T=j,{D}_{i})$ and is given by the following equation:

The prior distribution specified for $\mathbb{P}({x}_{i}|t)$ to derive equation (17.7) is known as the *parameter prior* [4]. Equation (17.7) provides an alternative way to estimate the parameters of a NB model with features *X* and target *T* instead of the commonly used maximum likelihood estimates.

In a manner similar to equation (17.7), the Bayesian estimate for $\mathbb{P}({X}_{i}=k|{D}_{i})$ is given by the following equation:

Next, the terms $\mathbb{P}(T\to X|{D}_{i})$ and $\mathbb{P}(T\dots X|{D}_{i})$ in equation (17.6) are derived.

The terms on the right side of equation (17.9) are calculated as follows. The term $\mathbb{P}(T\to {X}_{i})$ is the prior probability that *T* and *X*_{i} are probabilistically dependent and $\mathbb{P}(T\dots {X}_{i})=1-\mathbb{P}(T\to {X}_{i})$. The prior probability is also known as the *structure prior*. A uniform structure prior is specified by *q/n* where *n* is the total number of features and *q* is the expected number of features that are predictive of *T*.

The term $\mathbb{P}({D}_{i}|T\to {X}_{i})$ in equation (17.9) is given by the following equation based on assumptions described in Appendix 17.A (page 440):

The details of the derivation of equation (17.11) are provided in Appendix 17.A. In a manner similar to equation (17.11), $\mathbb{P}({D}_{i}|T\dots {X}_{i})$ in equation (17.9) is derived as follows:

Now, all the terms in equation (17.6) have been derived, and the model-averaged posterior probability for each value *t* of *T* conditioned on *x* is estimated using equation (17.4).

# 17.5 Evaluation Protocol

This subsection briefly describes the Alzheimer’s disease GWAS data set and the evaluation protocol used to assess the performance of the MANB and NB algorithms.

## 17.5.1 Data set

Several GWASs for LOAD have been conducted, and the data set used for evaluating the MANB algorithm is from one such study. The LOAD GWAS data were collected and analyzed originally by [13]. The genotype data were collected on 1411 individuals, of which 861 had LOAD and 550 did not. The target variable that is predicted is the LOAD status of the individual; it is binary and has the values *absent* and *present*. For each individual, the genotype data consist of 502 627 SNPs that were measured on an Affymetrix chip, reduced to 312 316 SNPs after applying quality controls by the original investigators. In addition, two APOE-related SNPs, namely rs429358 and rs7412, were separately genotyped. In total, 312 318 SNPs were used as features.

## 17.5.2 Protocol

The performances of the algorithms were evaluated using fivefold cross-validation. The data set was randomly partitioned into five approximately equal subsets, such that each subset had a similar proportion of individuals who had LOAD. For each algorithm, a model was trained on four subsets and was applied to obtain a LOAD prediction for each instance (individual) in the remaining subset; this process was done once for each of the five subsets. The predictions on the five subsets were then pooled to obtain a LOAD prediction for each of the 1411 individuals in the data set. The performance measures reported are based on those 1411 predictions.

Two performance measures are used: one measures discrimination and the other measures calibration. The area under the Receiver Operating Characteristic (ROC) curve (AUC) was used for measuring discrimination and the Hosmer-Lemeshow goodness-of-fit statistic was used for measuring calibration. Details of these two measures are provided in Appendices 17.B and 17.C (pages 443–444).

In addition, each algorithm’s computation times are reported. The fivefold cross-validation process generated five models for each algorithm. For a given algorithm, the training time results reported are the average computation times for the five models learned by the algorithm.

# (p.439) 17.6 Results

The performance of MANB was compared to that of NB on AUC, calibration, and computation time. For MANB, the parameters were estimated from the data with equations (17.7) and (17.8), and for NB, the parameters were estimated from the data with equation (17.7). For MANB, the structure prior *m/n* was set to 20/312318. The value 20 was chosen subjectively, informed by the number of strongest SNP predictors of LOAD that have been reported in the literature.

Fig. 17.2 shows that MANB has an AUC of about 0.72, while NB has an AUC of about 0.59, which is significantly different from that of the AUC of MANB (*p* < 0.00001). NB predicted almost all the test instances as having a posterior probability for LOAD very close to 0 or 1; such extreme predictions tend to occur with NB when there are a large number of features in the model. Fig. 17.3 (page 440) shows that NB (triangles) is very poorly calibrated, while MANB is better calibrated (circles) than NB.

In terms of computation time, both MANB and NB required only about 16 seconds to train a model (not including about 27 seconds to load the data into the main memory on a computer with 2.33GHz Intel processor and 2GB RAM). For both algorithms, the time required to predict each test instance was less than 0.1 second.

# (p.440) 17.7 Conclusion

Developing prediction models from GWAS data present several challenges including feature selection in high-dimensional data, learning models with good discrimination, and calibration and computational efficiency. This chapter described the MANB algorithm that is designed to predict outcomes from GWAS data by averaging over a large number of NB models. The results show that, when evaluated on a LOAD GWAS data set, MANB performed significantly better than NB, in terms of both AUC and calibration, and that MANB was computationally as efficient as NB.

# APPENDIX 17.A Derivation of the Bayesian Score

This section describes the derivation of the Bayesian score expressed in equation (17.12) in Subsection 17.4.2 (page 436) and the assumptions made in deriving it.

Let *X* be a set of *n* discrete-valued features, namely $\{{X}_{1},{X}_{2},\dots ,{X}_{n}\}$. Let feature *X*_{i} have *r*_{i} possible values, and let the target *T* have *r*_{T} possible values. Let the training data set *D* where $D=X\cup T$ contain *m* instances, and let *D*_{i} denote the part of the data set *D* that contains the
(p.441)
values for just *X*_{i} and *T*. Let *N*_{ijk} denote the number of times in *D*_{i} that feature *X*_{i} has the value *k* when target *T* has the value *j*. Let ${N}_{ij}=\sum _{k=1}^{{r}_{i}}{N}_{\mathit{\text{ijk}}}$ and ${N}_{i}=\sum _{j=1}^{{r}_{T}}{N}_{ij}$.

Equation (17.11) in Subsection 17.4.2 (page 436) is rewritten as:

where *S* denotes the Bayesian network structure $T\to {X}_{i}$; the arc indicates that *T* and *X*_{i} are probabilistically dependent.

The following assumptions are made in deriving equation (17.11) [4, 5]:

1 Multinomial variables: The features in

*X*and the target*T*are multinomial and hence discrete valued.-
2 Independent and identically distributed data: Instances in the training data set

*D*are independent and identically distributed. -
3 Complete data. The training data set

*D*is complete; that is, each feature and the target has a value in each instance in*D*. -
4 Parameter independence. The conditional probability distribution $\mathbb{P}({X}_{i}|t,D)$ for the feature

*X*_{i}given a target value*t*is independent of the conditional probability distribution $\mathbb{P}({X}_{i}|{t}^{\prime},D)$ for the same feature*X*_{i}given any other target value*t*ʹ and the conditional probability distribution $\mathbb{P}({X}_{j}|t,D)$ for any other feature*X*_{j}. -
5 Dirichlet priors. The prior distributions for the conditional probabilities $\mathbb{P}({x}_{i}|t,D)$ are specified by Dirichlet distributions. Moreover, the Dirichlet prior distributions are assumed to be uniform before observing

*D*. -
6 Structure modularity. The prior probability of an arc being present from target

*T*to the feature*X*_{i}is independent of the prior probability of an arc from*T*to any other feature*X*_{j}.

The following equation is obtained by applying assumption 1:

where *V* is a vector whose values denote the conditional probability values for the BN structure *S*, and *f* is the conditional probability density function over *V* given *S*. The integral is over all possible values that *V* takes.

Since $\mathbb{P}(S)$ is a constant, it can be moved outside the integral to obtain

From equation (17.15), $\mathbb{P}({D}_{i}|S)$ is obtained as follows:

(p.442) From the independence of instances expressed in assumption 2, it follows that

where *C*_{h} is the *h*th instance in *D*_{i} and *m* is the number of instances in *D*_{i}.

Let *x*_{ih} denote the value of *X*_{i}, and *t*_{h} denote the value of *T* in instance *h*. Applying assumption 3, which states that the instances in *D* have no missing values, equation (17.17) is rewritten as

Let θ_{ijk} denote the conditional probability $\mathbb{P}({X}_{i}=k|T=j,V)$. An assignment of numerical values to θ_{ijk} for $k=1$ to *r*_{i} is a probability distribution that will be represented by the list $({\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}})$. In addition, for a given *j* let $f({\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}})$ denote the probability density function over $({\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}})$. The density function $f({\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}})$ is called a second-order probability distribution because it is a probability distribution over a probability distribution. Assumption 4 can be expressed as

which states that the values of a second-order probability distribution $f({\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}})$ are independent of the values of any other second-order probability distribution $f({{\mathrm{\theta}}_{ij1}}^{\prime},\dots ,{{\mathrm{\theta}}_{ij{r}_{i}}}^{\prime})$.

Substituting θ_{ijk} for $\mathbb{P}({X}_{i}=k|T=j,V)$ in equation (17.18), and substituting equation (17.19) into equation (17.18) gives the following:

where the integral is taken over all θ_{ijk} for *j* = 1 to *r*_{T} and *k* = 1 to *r*_{i}, and for every *i* and *j* the following condition holds: $\sum _{k}{\mathrm{\theta}}_{\mathit{\text{ijk}}}=1$. Since the terms in equation (17.20) are independent, the integral of products is converted to a product of integrals to obtain the following equation:

Assumption 5 states that before observing *D*, all possible values for the conditional probabilities ${\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}}$ are equally likely. It therefore follows that $f({\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}})={C}_{ij}$ for some constant *C*_{ij}. Since $f({\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}})$ is a probability density function, it follows that

Solving equation (17.22) for *C*_{ij} yields ${C}_{ij}=({r}_{i}-1)!$, which implies that $f({\mathrm{\theta}}_{ij1},\dots ,{\mathrm{\theta}}_{ij{r}_{i}})=({r}_{i}-1)!$. Substituting this result in equation (17.20) gives

Since $({r}_{i}-1)!$ is a constant in equation (17.23), it can be moved outside the integral to obtain

The multiple integral in equation (17.24) has the following solution [16]:

Substituting equation (17.25) into equation (17.24) gives the following result and completes the derivation:

# APPENDIX 17.B ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve measures how well a model’s predictions (e.g., posterior probabilities) discriminate between the two values of a binary target variable. The ROC curve is a plot of sensitivity (also known as the *true positive rate*) versus 1-specificity (also known as the *false positive rate*) for different cut-points. A cut-point specifies a value; if the predicted value for an instance is above the cut-point, then it is assigned to one target value, and if it is below the cut-point, then it is assigned to the other target value. Each point on the ROC curve represents a pair of sensitivity and specificity values that corresponds to a particular cut-point. Given two values, namely, 0 and 1, that the target variable can take, sensitivity is defined as the probability of predicting correctly an instance that has target value 1, and specificity is defined as the probability of predicting correctly an instance that has target value 0. The area under the ROC curve (AUC) is typically used as a summary statistic of discrimination. The AUC is equivalent to the probability that a randomly chosen instance from instances with target value 0 will have a smaller predicted probability for value 1 than a randomly chosen instance from instances with target value 1. Higher values of AUC indicate better discrimination. A detailed description of the ROC curve and the AUC is provided in [14].

# (p.444) Appendix 17.C Calibration

Calibration measures how good a model’s predictions (e.g., posterior probabilities) are over a wide range of predictions for a binary target variable. Given two values, namely, 0 and 1, that the target variable can take, calibration assesses the goodness of predicted probabilities for the target value 0 (or 1). A model is well calibrated if the predicted probability for target value 0 for an instance corresponds closely to the observed proportion of instances with target value 0 in a set of similar instances. If, for example, a calibrated prediction model predicts that $\mathbb{P}(0|x)=0.7$, then for a set of instances with values *x* for the features in *X*, the observed proportion of instances with target value 0 is approximately 0.7 (i.e., the target takes the value 0 about 70% of the time).

There are several measures available for assessing calibration, with the Hosmer–Lemeshow goodness-of-fit statistic being a popular one [9]. The Hosmer–Lemeshow goodness-of-fit statistic compares the average predicted probabilities to the observed proportion in ten defined intervals of probabilities. In order to calculate the value of the statistic, the instances are sorted in ascending order of their corresponding predicted probabilities. The instances are then categorized into ten groups using equal probability intervals, namely, $0-0.1,0.1-0.2,\dots ,0.9-1.0$. For each group, the observed proportion is the number of instances with target value 0 (or 1) divided by the total number of instances in the group, and the average predicted probability is the average of the predicted probabilities for target value 0 (or 1) for the same instances. The ten pairs of values are then compared using the chi-square test. For well-calibrated predictions, the observed proportions will be close to the average predicted probabilities. This will result in a small chi-square value and a non-significant *p*-value (*p* > 0.05). In a calibration plot, each point represents a pair of observed proportion and average predicted probability values. For well-calibrated predictions, the plotted points are close to the 45 degree line (shown as a dotted line in Fig. 17.3). A detailed description of the calibration plot is provided in [14].

References

Bibliography references:

[1] L. Bertram, C.M. Lill, and R.E. Tanzi. The genetics of Alzheimer disease: Back to the future. *Neuron*, 68(2):270–281, 2010.

[2] L. Bertram, M.B. McQueen, K. Mullin, D. Blacker, and R.E. Tanzi. Systematic meta-analyses of Alzheimer disease genetic association studies: The AlzGene database. *Nature Genetics*, 39(1):17–23, 2007.

[3] W. Buntine. Theory refinement on Bayesian networks. In B. D’Ambrosio and P. Smets, editors, *Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence (UAI 91)*, pages 52–60. Morgan Kaufmann Publishers, 1991.

[4] G.F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. *Machine Learning*, 9(4):309–347, 1992.

[5] D. Dash and G.F. Cooper. Exact model averaging with naive Bayesian classifiers. In *Nineteenth International Conference on Machine Learning*, pages 91–98. Morgan Kaufmann Publishers, 2002.

[6] S. Dreiseitl and L. Ohno-Machado. Logistic regression and artificial neural network classification models: a methodology review. *Journal of Biomedical Informatics*, 35:352–359, 2002.

[7] M. Goedert and M. G. Spillantini. A century of Alzheimer’s disease. *Science*, 314(5800):777–781, 2006.

[8] J.A. Hoeting, D. Madigan, A.E. Raftery, and C.T. Volinsky. Bayesian model averaging: a tutorial. *Statistical Science*, 14(4):382–417, 1999.

(p.445)
[9] D.W. Hosmer and S. Lemeshow. *Applied Logistic Regression*. Wiley, 2nd edition, 2000.

[10] D. Koller and N. Friedman. *Bayesian Model Averaging. Probabilistic Graphical Models*. MIT Press, 2009.

[11] D. Madigan and A.E. Raftery. Model selection and accounting for model uncertainty in graphical models using Occam’s window. *Journal of the American Statistical Association*, 89:1535–1546, 1994.

[12] A.E. Raftery, D. Madigan, and C.T. Volinsky. Accounting for model uncertainty in survival analysis improves predictive performance. In J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, editors, *Bayesian Statistics 5*, pages 323–349. Oxford University Press, 1995.

[13] E.M. Reiman, J.A. Webster, A.J. Myers, J Hardy, T Dunckley, V.L. Zismann, K.D. Joshipura, J.V. Pearson, D. Hu-Lince, M.J. Huentelman, D.W. Craig, K.D. Coon, W.S. Liang, R.H. Herbert, T. Beach, K.C. Rohrer, A.S. Zhao, D. Leung, L. Bryden, L. Marlowe, M. Kaleem, D. Mastroeni, A. Grover, C.B. Heward, R. Ravid, J. Rogers, M.L. Hutton, S. Melquist, R.C. Petersen, G.E. Alexander, R.J. Caselli, W. Kukull, A. Papassotiropoulos, and D.A. Stephan. GAB2 alleles modify Alzheimer’s risk in APOE epsilon4 carriers. *Neuron*, 54(5):713–720, 2007.

[14] M. Vuk and T. Curk. ROC curve, lift chart and calibration plot. *Metodološki zvezki*, 3(1):89–108, 2006.

[15] W. Wei, S. Visweswaran, and G.F. Cooper. The application of naive Bayes model averaging to predict Alzheimer’s disease from genome-wide data. *Journal of the American Medical Informatics Association*, 18(4):370–375, 2011.

[16] S.S. Wilks. *Mathematical Statistics*. Wiley, 1962.

[17] K.Y. Yeung, R.E. Bumgarner, and A.E. Raftery. Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data. *Bioinformatics*, 21:2394–2402, 2005.
(p.446)