## Christine Sinoquet and Raphaël Mourad

Print publication date: 2014

Print ISBN-13: 9780198709022

Published to Oxford Scholarship Online: December 2014

DOI: 10.1093/acprof:oso/9780198709022.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2019. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use.  Subscriber: null; date: 23 October 2019

# Bayesian, Systems-based, Multilevel Analysis of Associations for Complex Phenotypes: from Interpretation to Decision

Chapter:
(p.318) Chapter 13 Bayesian, Systems-based, Multilevel Analysis of Associations for Complex Phenotypes: from Interpretation to Decision
Source:
Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics
Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780198709022.003.0013

# Abstract and Keywords

The relative scarcity of the results reported by genetic association studies (GAS) prompted many research directions. Despite the centrality of the concept of association in GASs, refined concepts of association are missing; meanwhile, various feature subset selection methods became de facto standards for defining multivariate relevance. On the other hand, probabilistic graphical models, including Bayesian networks (BNs) are more and more popular, as they can learn nontransitive, multivariate, nonlinear relations between complex phenotypic descriptors and heterogeneous explanatory variables. To integrate the advantages of Bayesian statistics and BNs, the Bayesian network based Bayesian multilevel analysis of relevance (BN-BMLA) was proposed. This approach allows the processing of multiple target variables, while ensuring scalability and providing a multilevel view of the results of multivariate analysis. This chapter discusses the use of Bayesian BN-based analysis of relevance in exploratory data analysis, optimal decision and study design, and knowledge fusion, in the context of GASs.

Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics. First Edition. Christine Sinoquet & Raphaël Mourad (Eds).© Oxford University Press 2014. Published in 2014 by Oxford University Press.

The relative scarcity of the results reported by genetic association studies (GASs) has prompted many research directions, such as the use of univariate Bayesian analysis and the use of multivariate, complex, or integrated models. Despite the centrality of the concept of association in GASs, refined concepts of association are missing, meanwhile various feature subset selection methods became de facto standards for defining multivariate relevance. On the other hand, probabilistic graphical models, including Bayesian networks (BNs), are more and more popular, as they can learn non-transitive, multivariate, non-linear relations between complex phenotypic descriptors and heterogeneous explanatory variables. To integrate the advantages of Bayesian statistics and BNs, the BN-based Bayesian multilevel analysis of relevance (BN-BMLA) was proposed. This approach allows the processing of multiple target variables, while ensuring scalability and providing a multilevel view of the results of multivariate analysis. This chapter (p.319) discusses the use of Bayesian BN-based analysis of relevance in exploratory data analysis, optimal decision and study design, and knowledge fusion, in the context of GASs. First, various BN-based concepts of association and relevance are overviewed. In particular, the chapter analyzes the connections between BNs and strong/weak relevance and Markov blankets/boundaries or related sets; relevance relations are defined. Then the advantages of the Bayesian statistical approach for sufficiently characterizing and exploring weakly significant results are shown. For this purpose, the focus is set on the posteriors that are defined over the relevance relations abovementioned. The next section discusses Bayes optimal decisions on multivariance relevance in GAS results. In the last section, it is shown that the Bayesian BN-based approach provides a framework for the fusion of the results obtained through various genetic data analyses. This last section describes a procedure dedicated to estimating the posteriors of complex features, such as those involved in the hierarchical, interrelated hypotheses of the BN-BMLA framework.

# 13.1 Introduction

The relative scarcity of results reported by genetic association studies has led to several approaches such as univariate Bayesian analysis [5, 52] and the use of multivariate, complex or integrated models [61, 65]. Probabilistic graphical models, including Bayesian networks (BNs), are more and more popular as they can learn non-transitive, multivariate, non-linear relations between complex phenotypic descriptors (also known as target variables, dependent variables, or dependent outcomes) and heterogeneous, mostly genetic, explanatory variables (input variables, factors or predictors, also known as attributes or features in statistics). We discuss the limitations of conventional statistical association here and show how they can be circumvented using the notion of the so-called relevance of input variables to one or more target variables. We will show that relevance is a useful extension of association in Subsection 13.2.1. Recently we proposed the Bayesian network-based Bayesian multilevel analysis of relevance (BN-BMLA), which integrates the advantages of Bayesian statistics and BNs [4]. Moreover, this approach allows investigation of multiple target variables and provides scalable intermediate levels between univariate strong relevance and full multivariate relevance to interpret the results at partial multivariate levels. Thus, BN-BMLA provides a multilevel view on multivariate analysis. We discuss the use of the BN-BMLA relevance analysis in data exploration, optimal decision, and knowledge fusion.

First, we overview some of the structural properties of BNs, with special emphasis on systems-based analysis of relevance in Section 13.2. Section 13.3 shows the advantages of the Bayesian statistical approach in the characterization and exploration of weakly significant results. In Section 13.4, we discuss the application of Bayesian decision theory to GAS. In Section 13.5, we deal with the Bayesian interpretation and fusion of results from data analysis. In this chapter, we also consider the practical and computational aspects of the Bayesian inference: we apply the methods described in the domain of asthma.

Input variables in the current application typically correspond to single nucleotide polymorphisms (SNPs), whereas target variables represent asthmatic and possibly other phenotypic aspects. The notation is as follows: input variables (typically SNPs) are denoted with X ($X={X1,…,Xn}$), target variables (typically asthma and possibly other phenotypic variables) are denoted with Y ($Y={Y1,…,Ym}$) and V ($V=X∪Y$) denotes all the variables. The data set with N samples is denoted DN.

# (p.320) 13.2 Bayesian network-based Concepts of Association and Relevance

Refined concepts of association are missing in genetic association studies despite the centrality of the concept. Therefore, various feature subset selection (FSS) methods became de facto standards for quantifying the joint relevance of multiple variables and their interactions, which will be referred to as multivariate relevance (for an overview of FSS the reader is referred to [48]). In this section, we discuss the use of Bayesian network structural properties to define such refined concepts.

## 13.2.1 Association and Strong Relevance

The extension of standard statistical pairwise association (univariate case) toward many-to-one or many-to-many relations (multivariate case) is a challenging task, because several goals can be formulated, such as assessing the predictive performance and interdependence of multiple predictors.

In the predictive approach to the identification of relevant variables, relevance to the target variable Y is defined in the wrapper framework below.1 The inherent limitation of a wrapper approach is that it is biased by the class of predictive model used, by the optimization algorithm, by the data set, and by the loss function2 quantifying false and missed discovery [32]. A typical example setting in case-control studies is the use of logistic regression fitted by a gradient descent method to minimize misclassification error, optionally with complexity regularization to minimize overfitting for a given data set. The standard conditional probabilistic version of relevance types is free of model class, optimization, data set, or loss function, and is defined as follows:

Definition 13.1

A feature (input variable) Xi is strongly relevant to Y, if there exists an $Xi=xi,Y=y$, and $si=x1,…,xi−1,xi+1,…,xn$, $Si={X1,…,Xi−1,Xi+1,…,Xn}$ such that $P(xi,si)>0$ and $P(y|xi,si)≠P(y|si)$. A feature Xi is weakly relevant if it is not strongly relevant and there exists a subset of features $Si′$ of Si for which there exists some $xi,y$, and $si′$ such that $P(xi,si′)>0$ and $P(y|xi,si′)≠P(y|si′)$. A feature is relevant if it is either weakly or strongly relevant; otherwise it is irrelevant [32].

Strong relevance formalizes the necessary and sufficient set of predictors, whereas weak relevance does not require necessity. For example, direct associations correspond to strong relevance, whereas indirect associations, mediated by other variables in linkage, correspond to weak relevance (for direct, indirect, and confounded associations, see [15]).

Besides the predictive aspects, relevance can also be defined using BNs to represent complex dependence patterns between phenotypic descriptors and multiple predictors, and to express causal relationships between variables. In general, a BN represents the joint probability distribution $P(V)$ with two main components. The first main component is a directed acyclic graph (DAG) G, whose nodes are random variables from the set V. We often refer to G as the structure of the BN. A directed edge pointing from a so-called parent variable Xp to its child variable Xc implies not only that Xc depends on Xp but also that Xp is a direct cause of Xc in the sense that (p.321) the BN does not contain any other variable Xi that mediates the effect of Xp on Xc. The strength of such dependence of Xc on Xp is given by the extent to which the probability distribution of Xc shifts in response to a certain change in the value of Xp. This gives rise to a separate set of parameters defining the conditional distribution for each variable, given each possible value of its parent variables; these parameter sets altogether form θ‎, the second main component of the BN.

The BN representation is unique in offering three integrated levels to formulate these aspects [42]:

1. 1 Predictive performance: Multivariate relevance can be based on the concepts of predictivity, cardinality and redundancy of the predictors.

2. 2 Global patterns of independences: The global characterization of relevance relations based on the overall independence map3 of the application domain is of fundamental importance to clarify direct, indirect and confounded associations [15].

3. 3 Causal aspects: A causal model4 also helps to dissect and estimate the effect size of genetic mechanisms related to complex phenotypes described by multiple variables [42].

The independence map-based approach provides the definition of joint relevance of multiple input variables (therefore, this approach is more naturally suited to multivariate analysis) [41]:

Definition 13.2

A set of variables $X′⊆V$ is called a Markov blanket set (MBS) of Y with respect to the distribution $P(V)$, if $(Y⊥⊥V∖X′|X′)p$, where $⊥⊥$ denotes conditional independence. A minimal Markov blanket is called a Markov boundary.

The probabilistic definitions of Markov blanket and Markov boundary became fundamental concepts in the modern multivariate conceptualization of multivariate associations and provide foundations and motivations for many feature subset selection methods [1, 34, 56]. It follows from these basic characteristics of BNs that they offer a graphical representation of relevance derived both from the predictive approach (cf. strong and weak relevance) and the independence map-based one (cf. Markov blankets and boundaries). Thus, BNs provide a unifying framework for both approaches. Additionally, this common framework allows the extension of the concept of relevance relations with causality. The connections between BNs, strong/weak relevance, and Markov blankets/boundaries are provided by the following theorems:

Theorem 13.1

For a distribution $P(V)$ defined by the BN $(G,θ)$ the variables bd(Y,G) form a (not necessarily unique or minimal) Markov blanket of Y, where bd(Y,G) denotes the set of parents, children, and the children’s other parents for Y [41].

Theorem 13.2

If $P(V)$ is a stable distribution5 and DAG G is Markov compatible with $P(V)$,6 then bd(Y,G) is the unique, minimal Markov blanket (called a Markov boundary). In stable distributions, bd(Y,G) also identifies the strongly relevant variables [56].

(p.322) A BN structure explicitly represents both aspects of the Markov boundary: on the one hand, bd(Y,G) is necessary because it contains only strongly relevant variables; on the other hand, bd(Y,G) is sufficient because none of the remaining variables in V (i.e., the remaining nodes in G) is strongly relevant.

Definition 13.3

The induced pairwise relation $MBM(Y,Xi,G)$ with respect to G between Y and Xi is called Markov blanket membership (MBM):

(13.1)
$Display mathematics$

As Theorem 13.2 shows, the boundary graph bd(Y,G) exactly represents the (probabilistically defined) Markov boundary, and the MBMs $MBM(Y,Xi,G)$ exactly represent the (probabilistically defined) strongly relevant variables Xi for a stable distribution $P(V)$ and a Markov compatible DAG G. The following Subsection 13.2.2 describes the mathematical background of the soundness of representing relevance by bd(Y,G). Readers interested only in the application of the concepts can continue at Subsection 13.2.3 (page 323).

## 13.2.2 Stable Distributions, Markov Blankets and Markov Boundaries

The limitation of the assumption about the stability of the distribution is a subtle issue. The main reason lies in the independences, which are only parametrically and not structurally defined [41, 42]. Regardless of whether our hypotheses are the BN structures (DAGs G) or the observational equivalence classes of BNs7 ($G∼$), then for a given G or $G∼$ the structurally implied conditional independences through d-separation8 hold in any Markov compatible distribution $P(V)$ [42]. This means that bd(Y,G) represents a Markov blanket in any Markov compatible distribution and $MBM(Y,Xi,G)$ represents relevant variables. Note that the graphical definition of MBM satisfies the probabilistic definition of Markov blanket in Definition 13.2 according to Theorem 13.1 assuming a Markov compatible distribution with G, because bd(Y,G) is always a Markov blanket. Furthermore, the definition of relevance based on bd(Y,G) avoids the problem of the multiplicity of Markov blankets and defines mutually exclusive and exhaustive events (a sample space), which will be necessary to define a probability distribution over the sets of strongly relevant variables (see Section 13.3, page 323).

However, according to Theorem 13.2, in case of a stable distribution, the implied results are sharper: bd(Y,G) represents a Markov boundary and $MBM(Y,Xi,G)$ represents strongly relevant variables. Whereas only Markov blanket and relevance can be used generally, the non-stable distributions are rare in a technical sense. This property formally appears in the Bayesian approach we are using in this chapter as follows. Using a continuous parameter distribution $P(θ|G)$, the specified distribution $P(V)$ is stable with probability 1 [38]. Thus, in that Bayesian context, bd(Y,G) and $MBM(Y,Xi,G)$ almost surely represent Markov boundaries and strong relevance and therefore we can use the stricter Markov boundary concept, despite the more widespread use of the term of Markov blanket.

For simplicity, we also refer to bd(Y,G) as the Markov boundary set of Y using the notation $MBS(Y,G)$, assuming that the hypothesized distribution(s) is (are) stable with respect to G. Under (p.323) the same assumption, we also refer to $MBM(Y,Xi,G)$ as a strong relevance relation. According to Theorem 13.2, this graph-based definition of the MBM relation is equivalent to the probabilistic definition of the strong relevance relation assuming a Markov compatible stable distribution. Fig. 13.1 shows the possible relevance status of the variables, in cases of stability and non-stability of distributions Markov compatible with G.

Fig. 13.1 Possible relevance status of the variables in distributions Markov compatible with G. The left and right sides indicate the cases of stable and non-stable distributions respectively (in both cases Markov compatible with G). In non-stable distributions, the set of strongly relevant variables and a Markov blanket for Y can be smaller than bd(Y,G), but bd(Y,G) is always a Markov blanket for Y.

## 13.2.3 Further relevance types

The concept of multivariate relevance can be further enriched using the common conceptual framework of BNs to represent all the relevant interactions between the strongly relevant variables as follows:

Definition 13.4

A subgraph of BN structure G is called the Markov blanket graph or mechanism boundary graph (MBG) MBG(Y,G) of variable Y if it includes the nodes in the Markov blanket defined by bd(Y,G) and the incoming edges into Y and its children [3].

An MBG represents all the necessary and sufficient interactions with respect to the Markov boundary of a given variable; thus it has a fundamental role both in the prediction of a target variable and in the interpretation of the interactions and mechanisms for a target variable.

Throughout the chapter, we use the term feature in a broad sense to denote a function $F(V′,G)=f$ with arguments Vʹ ($V′⊆V$) and DAG G, e.g., MBG(Y,G)=mbg, MBS(Y,G)=mbs and MBM(Y,Xi,G); that is, we concentrate only on the structural aspects of DAGs for a given subset of variables Vʹ. However, the concept of features over DAGs is also used to denote a quantitative property of a given DAG, such as maximal in-degrees, out-degrees, clique sizes, or length (p.324) of directed paths. Additionally, the term “feature” is also used to denote variables, e.g., in the case of feature subset selection (FSS) problems.

The Markov blanket-based extensions of the relevance concepts express local aspects, i.e., neighboring variables. The advantages of BNs for representing global dependence maps and relevance relations are well known, but their direct application in FSS is hindered by their high computational and sample complexity.9 Now we consider the use of the global representational properties of BNs to characterize relevance relations, e.g., the type of association between non-neighboring, distant variables. The subtypes of relevance are listed in Table 13.1 along with the notation assigned to them.

Table 13.1 Graphical model based definition of relevance types.

Relation

Notation

Graphical

Direct causal relevance

DCR(X,Y)

There is an edge between X and Y.

Indirect causal relevance

ICR(X,Y)

There is directed path between X and Y.

Causal relevance

CR(X,Y)

DCR(X,Y) or ICR(X,Y).

Confounded relevance

ConfR(X,Y)

X and Y have a common ancestor.

(Pairwise) Association

A(X,Y)

DCR(X,Y) or ICR(X,Y) or ConfR(X,Y).

Interactionist relevance

IR(X,Y)

X and Y have a common child.

Strong relevance

SR(X,Y)

IR(X,Y) or DCR(X,Y) (also MBM(X,Y)).

Mixed conditional relevance

MCondR(X,Y)

There is an undirected path between X and Ywith a node with converging edges.

Conditional relevance

CondR(X,Y)

IR(X,Y) or MCondR(X,Y).

Weak relevance

WR(X,Y)

MCondR(X,Y) or ICR(X,Y) or ConfR(X,Y)

For their causal interpretation under the causal Markov assumption, the reader is referred to [26, 42].10 It needs to be emphasized that these relations represent different aspects of relevance and there are subtle differences with respect to their use in genetics, because of the possibility of multiple target variables and the possibility of non-genetic predictors. Following the usual genetic terminology [15], direct relevance (DR) formalizes the concept of direct association, although it covers direct consequences as well. Indirect causal relevance (ICR) and confounded relevance (CR) separate and express the concepts of indirect association and confounded association. Pairwise association (A) represents the usual association, which is the union of direct, indirect, and confounded associations. Interactionist relevance (IR) deviates from the purely epistatic relation [16], because this latter is the consequence of vanishing marginal effects of the individual variables, which can be modeled by contextual dependences (see Definition 13.5, page 325). To summarize, the most notable difference between the standard concept of pairwise association and (p.325) strong relevance is that association includes certain forms of weak relevance (i.e., confounded relevance (ConfR) and transitive causal relevance (TCR)) and that it does not include the case represented by interactionist relevance. A direct consequence is that predictors exclusively in interactionist relevance will be filtered out by pairwise methods typical in high-dimensional studies, as they are not associated by definition. Also note that most of these relations are not mutually exclusive, e.g., a predictor may have direct and indirect effects at the same time, which can also be confounded. An overview of these relations is shown in Fig. 13.2. The applications of these relevance types are described in Subsection 13.3.6.

Fig. 13.2 Overview of types of pairwise dependences. Weak relevance is constituted by mixed conditional relevance, indirect causal relevance, and confounded relevance (see Table 13.1 and Fig. 13.1).

Hitherto, the definitions of relevance were based on the general concept of conditional independence, but conditional independence can be made more specific by introducing contextual independence when independence is present only in a given context (for its use in the field of BNs, the reader is referred to e.g., [6]). In genetics, this representation can express a related, seemingly complementary phenomenon if a variation has no effect with respect to a given target except in the presence of other variations [14, 45].

Definition 13.5

A variable Xi, which is strongly relevant to Y, is also contextually relevant to Y, if there exists some $Xi=xi1,xi2$ and context $si=x1,…,xi−1,xi+1,…,xn$ for which $P(xi1,si),P(xi2,si)>0$ such that $P(Y|xi1,si)=P(Y|xi2,si)$.

Contextual relevances can be exactly and explicitly represented by special local dependence models in BNs, such as decision trees or decision graphs, which can express dependences at the value level (which is below the variable level) [6]. These models allow the explicit representation of the contexts defined by value configurations $C=c(C⊂V)$, in those cases in which Xi is conditionally independent of Y. Non-BN-based methods implicitly capable of addressing contextual relevance include multifactor dimensionality reduction (MDR) [39] and logic regression [47].

The identification of a contextually relevant factor Xi is particularly challenging, because it has an effect—by definition—in particular contexts C=c. If its effect strength is moderate in these contexts or if these contexts are very rare, then reliable identification of the relevance of Xi statistically requires prior knowledge and multivariate modeling. Furthermore, it is also possible, in the case of a purely epistatic factor, that a strongly relevant predictor has no main effect [16]. (p.326) A practical consequence is that the statistical identification of a contextually relevant factor is not possible without observing other variables forming its context. In BNs lacking the ability to explicitly represent contextual independence, this phenomenon frequently leads to misinterpretations because of the incorrect representation of a contextually relevant factor Xi by an interactionist relevance relation IR(X,Y) with target variable Y.11 Using omics data, such as genome-wide association studies (GWAS) data or next-generation sequencing (NGS) data, the lack of observation of the other variables involved in the interaction is diminished. However, the problem of multiple hypothesis testing arises in the frequentist statistical framework, because of the high number of potential interactions: the reader is directed to Section 13.4 (page 344) for the discussion of multiple hypotheses in the Bayesian approach and to [19, 35, 65] for non-BN-based Bayesian methods to explore interactions.

## 13.2.4 Necessary Subsets and Sufficient Supersets in Strong Relevance

The MBM feature gives an overall characterization of strong relevance for each predictor separately but does not capture the joint relevance of predictors. At the other extreme, Markov boundary subsets characterize the joint strong relevance of predictors, but the number of possible MB sets is exponential, which is intractable computationally and statistically. The concept of k-ary Markov boundary subsets, focusing on k sized sets of variables, was introduced to support a constrained multivariate analysis of strong relevance, called partial multivariate analysis of relevance [4]. Here we complement this sub-relevance concept with an analogous concept of sup-relevance.

Definition 13.6

For a distribution $P(V)$ with Markov boundary set mbs, a set of variables s is called sub-relevant if it is a k-ary Markov boundary subset (k-subMBS), i.e., $|s|=k$ and $s⊆mbs$. A set of variables s is called sup-relevant if it is a k-ary Markov boundary superset (k-supMBS), i.e., $|s|=k$ and $mbs⊆s$.

The k-subMBS and k-supMBS concepts are more flexible than the MBS relation, because they can be used to capture the presence or the absence of relevant variables. A k-subMBS set ssub contains those variables that are strongly relevant without specifying the status of the other variables in its complementary $ssubc$ (i.e., a k-subMBS denotes a “necessary” set of variables). The complement of a k-supMBS set $ssupc$ contains all variables that are not strongly relevant, but it does not state anything about the strong relevance of the variables in $ssup$ (i.e., a k-supMBS denotes a “sufficient” set of variables). Note that the probabilistically defined concept of sub-relevance can be seen as a multivariate extension of strong relevance for variables in Definition 13.1 (page 320). Note also that the concept of sup-relevance is equivalent to the concept of Markov blankets, but for symmetry and uniformity we use the “sub-” and “sup-” terminology. We use the term k-MBS to denote together the k-subMBS and k-supMBS concepts.

(p.327) Furthermore, the k-subMBS and k-supMBS concepts form hierarchically related, overlapping hypotheses, e.g., a k-subMBS hypothesis can be extended to multiple, more complex (k+1)-subMBS hypotheses. In fact, the scalable polynomial cardinality of the set of k-subMBSs and k-supMBSs bridges the linear cardinality of the features and the exponential complexity of the MBS cardinality ($O(n)), where n denotes the number of observed variables. Because the cardinality of the MBGs and DAGs is even higher [12], we can think of MBMs, k-subMBSs/k-supMBSs, MBSs, MBGs, and DAGs as more and more complex hypotheses about relevance. Indeed, they form a hierarchy of levels of abstraction to support a multilevel analysis, in which intermediate levels of k-MBS for varying k allow a scalable partial multivariate analysis focusing on k number of variables.

An application of these concepts to the problem domain of asthma is illustrated in Fig. 13.3.

Fig. 13.3 Sub-relevant sets for the target variable Rhinitis in stable distributions which are Markov compatible with the depicted Bayesian network structure G. Sub-relevant sets are denoted by dashed circles. The target variable is highlighted in bold. Sub-relevant sets: 1-subMBS: X3; 2-subMBS: IgE, Eosinophil (Eos.); 3-subMBS: X5,X6, Asthma. The Markov boundary of Rhinitis is denoted by shaded nodes, and any set including these nodes is a sup-relevant set (k-supMBS). Pairwise relevance relations are as follows: interactionist relevance (e.g., Rhinitis and X6 have a common child), direct causal relevance (e.g., X1 and IgE have a common edge), confounded relevance (e.g., Rhinitis and Eosinophil level have a common ancestor), and mixed conditional relevance e.g., X7 and IgE are connected through an undirected path on which Asthma has two converging edges).

## 13.2.5 Relevance for Multiple Targets

If there are multiple target variables Y which have to be examined together, and if the relations among them are irrelevant, one may ask for the variables relevant to the target set. The basic concepts of relevance for a single target variable can easily be extended to multiple targets [4].

Definition 13.7

A variable Xi is strongly (weakly) relevant to Y, if and only if it is strongly (weakly) relevant to any $Yi∈Y$.

We introduce additional relations, such as exclusive or multiple relevance, in order to better characterize the relevance types of a predictor with respect to multiple target variables. The corresponding definitions are given in Table 13.2, using a Bayesian network representation. We refer the reader to Subsection 13.3.6 (page 340) for the application of these relations in asthma and allergy research.

Table 13.2 Subtypes of multitarget relevance. The other targets except Yʹ are denoted as $Y−$ ($Y−=Y∖Y′$).

Direct relation(s) to

Notation

Graphical

One or more targets

EdgeToAny(X,Y)

There is an edge between X and Y.

Exactly one of the targets

EdgeToExactlyOne(X,Y)

There is exactly one edge

between X and Y.

Two or more targets

MultipleEdges(X,Y)

There is more than one

edge between X and Y.

One or more other targets

$EdgeToOthers(X,Y−,Y′)$

There is at least one edge between

X and Y, but not between X and Yʹ.

MBM relation(s) to

Notation

Graphical

One or more targets

MBMToAny(X,Y)

There is a $Yi∈Y$ with $MBM(X,Yi)$.

Exactly one of the targets

MBMToExactlyOne(X,Y)

There is exactly one $Yi∈Y$

with $MBM(X,Yi)$.

Two or more targets

MultipleMBMs(X,Y)

There is at least two $Yi∈Y$

with $MBM(X,Yi)$.

One or more other targets

$MBMWithOthers(X,Y−,Y′)$

There is a $Yi∈Y−$

with $MBM(X,Yi)$ and $¬MBM(X,Y′)$.

# (p.328) 13.3 A Bayesian View of Relevance for Complex Phenotypes

The Bayesian network representation, along with the concept of Markov blanket sets and strong relevance, opened many research directions in feature learning, in the feature subset selection problem and in genetic association studies [48]. The “filter” approaches, later termed “local causal” approaches, emerged from Markov Blanket sets (MBSs) and strong relevance [1, 10, 34, 63, 28].12 However, despite the rapid development of methods designed to identify an optimal MBS, the global significance of the optimal MBS in the frequentist framework, as well as the lack of dominant MBSs in the Bayesian framework, are still overlooked.

Bayesian methods are increasingly popular in genetic association studies for their ability to sufficiently characterize and explore weakly significant results and to cope with multiple hypothesis testing (for the general approach see [18, 24]; for the application to GAS see [52]; for methods see [19, 35, 65]).

The Bayesian approach aims at determining the posterior distribution of each of the relevance relations defined in Section 13.2 (page 320) given some data, which is used to update the (p.329) prior probability distribution, i.e., our knowledge on these relations prior to obtaining those data. (Henceforth we will use the shorthand “posterior” both for the posterior probability of the specific value of a feature and for the posterior probability distribution of a feature in general if it is clear from the context.) These posterior distributions, such as the ones of Markov boundary graphs, Markov boundary sets, or Markov boundary memberships, are derived by the propagation of the DAG posterior to these “upper” levels by aggregation. Technically speaking, these posteriors are induced13 by the posterior $P(G|DN)$ over the DAG space given the data set DN, because MBG(Y,G), MBS(Y,G), and $MBM(Y,Xi,G)$ are functions (transformations) interpreted over DAGs (i.e., they map each DAG G to a unique value). Consequently, a (normalized) probability distribution over DAGs will induce a (normalized) distribution over the MBGs, the MBSs, and the MBM for each Xi as follows:

(13.2)
$Display mathematics$
(13.3)
$Display mathematics$
(13.4)
$Display mathematics$

where 1(A) denotes the indicator function14 of event A.

Note that MBS(Y,G) and $MBM(Y,Xi,G)$ depend only on MBG(Y,G); thus the functions $MBS(Y,G):G→mbs$, and $MBM(Y,Xi,G):G→{0,1}$ can be analogously defined over the MBGs $MBS(Y,MBG):mbg→mbs$ and $MBM(Y,Xi,MBG):mbg→{0,1}$. Thus, the DAG posterior $P(G|DN)$ is not necessary to define the MBS posterior ($P(MBS(Y,G)|DN)$) and MBM posteriors ($P(MBM(Y,Xi,G)|DN)$) because the MBG posterior ($P(MBG(Y,G)|DN)$) is sufficient. Analogously to equation (13.2), the MBG posterior induces the MBS posterior and MBM posterior, because of the functional dependence of MBSs and MBM for each Xi on MBGs.

(13.5)
$Display mathematics$
(13.6)
$Display mathematics$

Consequently, we can formulate the central task of the Bayesian analysis of relevance using BNs as the estimation of the posterior distribution of MBGs. Note that the posterior of k-subMBSs and k-supMBSs can also be induced by the MBG posterior. However, the MBG posterior is not sufficient to induce the posteriors of relevance types in Tables 13.1 (page 324) and 13.2 (page 328), and therefore these should be estimated individually. Thus, in addition to estimation of the MBG posterior, the general task also includes estimation of the expectation of arbitrary binary random (p.330) variables $1(F(V′,G)=f)$ over the space of DAGs, such as the indicator functions corresponding to relevance types in Tables 13.1 and 13.2.15 However, the exact computation of this expectation is generally not feasible due to the superexponential number of possible DAG structures; thus an approximation is applied using M randomly sampled DAG Gi from the posterior $P(G|DN)$:

(13.7)
$Display mathematics$

There are several classes of methods designed to facilitate the estimation in equation (13.7). The Bayesian inference of structural properties of BNs was proposed in [7, 12]. In [37], Madignan et al. proposed a Markov chain Monte Carlo (MCMC) scheme, to approximate such Bayesian inference using an unnormalized posterior $p˜(G|DN)$ and an auxiliary ordering of variables. In [22], Friedman et al. reported an MCMC scheme over the space of orderings. In [33], Koivisto et al. reported a method to perform exact full Bayesian inference over modular features.16 Recent applications in GAS are reported in [31, 61]. Another approach using PGMs applied the bootstrap framework to provide uncertainty measures for arbitrary subgraphs [20, 43]. To support Bayesian inference over relevance relations, we reported specialized ordering MCMC methods designed to efficiently estimate posteriors of structural model properties, particularly of Markov Blanket Graphs [3, 4].

## 13.3.1 Estimating the Posteriors of Complex Features

As we discussed in the previous subsection, the posterior probability of any structural feature $F(V′,G)=f$ can be estimated by MCMC methods if efficient computation exists for an unnormalized posterior $p˜(G|DN)$ and feature value $F(V′,G)=f$ (see equation (13.7)). However, the estimation of posterior distributions of complex features with high cardinality, such as the MBS, MBG, and k-MBS features, poses an additional challenge. Earlier we proposed a specialized ordering MCMC method to estimate MBG and MBS posteriors [3], but DAG MCMC methods are more suitable for general, potentially knowledge-intensive structural features; thus, we developed a two-step DAG-based process to estimate all the relations in the BN-BMLA methodology.

We assume a complete data set DN, discrete variables, multinomial sampling, Dirichlet parameter priors ($P(Θ|G)$), and a uniform structure prior ($P(G)$). We used two types of Bayesian Dirichlet parameter priors: the first is defined by the Cooper–Herskovits hyperparameters, and the second is defined by the Bayesian Dirichlet equivalent uniform (BDeu) hyperparameters with virtual sample size equal to one (for their definition, see [12, 30]; for the effect of various virtual sample sizes, see [50, 57]). These assumptions allow the derivation of an efficiently computable unnormalized posterior over the Bayesian network structures G (i.e., over directed acyclic graphs, (DAGs) [12, 30].

Using this unnormalized posterior, we perform a special random walk in the space of DAGs G by applying an MCMC sampling method, which inserts, deletes, and inverts edges [25]. The probability to apply different DAG operators in the proposal distribution is uniform, the length of the burn-in is 106, and the length of the sample collection (L) is $5×106$.

(p.331) The MCMC process generates a dependent sequence of L DAGs DL^G. Using the MCMC simulation, we estimate the posterior of the MBGs for the target variables according to equation (13.7) see Section 13.3 (page 328). In each MCMC step, we determine the boundary graph bd(Y,G) corresponding to the DAG G in this step and update the relative frequency of this boundary graph. (We recall that the presence of the boundary graph bd(Y,G) implies with probability one that the corresponding variables are the Markov boundary, see Subsection 13.2.2 (page 13.2.2).) This update is analogously done for all the pairwise relations in Table 13.1 (see Subsection 13.2.3, page 13.2.3) and in Table 13.2 (see Subsection 13.2.5, page 13.2.5); these relations are evaluated for every possible pair of variables at each MCMC step, and hence a counter is maintained for each possible instantiation of the given relation throughout the MCMC sampling. Note that there are no practical restrictions on the selection of target variables, i.e., within the same MCMC simulation we can evaluate multiple target sets simultaneously. The computational complexity of the evaluation of the structural features discussed in this chapter and the update of their relative frequency is O(n2) (n denotes the number of variables). In case of a complex phenotype with multiple descriptors, we can use the descriptors together as a joint target set, and each of the descriptors individually. An occasionally practical inversion is to select a predictor variable as the target, because we can explore all the phenotypes relevant for this predictor (for the application of this inversion in the frequentist framework, see [34]). Hence the set of evaluated features can be fully fitted to the needs, queries, or preconceptions of the expert performing the analysis.

In a second, “post-hoc”17 phase, we compute various MBS related, marginal posteriors from the MBG posterior estimated in the first phase. The MBS and MBM posteriors are computed exactly from the estimated MBG posterior analogously to equation (13.5) (see Section 13.3, page 328). The posterior of a given k-subMBS set or k-supMBS set can be directly computed from the MBS posterior according to equations 13.9 and 13.10 (see Subsection 13.3.4, page 13.3.4). To find the highly probable k-subMBS and k-supMBS sets, we apply greedy algorithms, because the cardinalities of these sets grow polynomially ($O(nk)$). In the case of k-subMBSs, the starting state of the greedy search is the empty set, which can be regarded as a trivial 0-subMBS with probability one. The algorithm then expands the set into the (k+1)-subMBS with maximal posterior. In the case of k-supMBSs, the initial state of the search is the complete set U, from which the algorithm iteratively eliminates predictors to get the (k-1)-supMBS with maximal posterior.

In the MCMC simulation, we also calculate various quantitative measures of convergence and confidence of the posteriors of complex features. The following set of measures can be regarded as a standard set:

1. 1 the Geweke Z-score, measuring convergence within one chain, i.e., the significance of the difference of the posteriors between the beginning and the end of the sampling [24]

2. 2 the Gelman–Rubin R-score, measuring inter-chain convergence, i.e., significance of the difference of independent sampling processes [24]

3. 3 confidence intervals, based on the standard error of the MCMC18

(p.332) Fig. 13.4 demonstrates the convergence of the applied MCMC sampling with respect to the burn-in phase. Note that these measures are different for each feature, i.e., the estimation of the MBM posteriors is typically faster than the estimation of the 3-subMBS posteriors or the MBG posterior. The values of the measures are calculated in each step l of the MCMC simulation using the MCMC samples of steps 1, …, l.

Fig. 13.4 The trace plots of the convergence and confidence indicators for the MBM posterior of rs7928208 and Asthma using data set A. A) Estimated values of the MBM posterior ($P(MBM|DN$)) and its 95% confidence interval (CI). B) Geweke’s score (Z) and Gelman–Rubin score (R) for each 10000th step (x-axis) of the MCMC simulation.

In the following, we demonstrate the application of this approach on a case study. The study involves 1201 unrelated individuals from the Hungarian (Caucasian) population. Four hundred and thirty-six asthmatic children, aged 3 to 18 were recruited for the study. The control group consisted of 765 subjects (mean age: 19 years, 405 males/360 females). We used three embedded data sets: (1) the asthma status was known in all cases (1201 subjects, data set A); (2) in 1100 cases, the status of Rhinitis was also known (data set RA) (only those subjects whose rhinitis status had been verified by specialists); and (3) in 200 cases, the status of rhinitis and the serum levels of IgE and Eosinophil were involved in this data set were also known (data set CLI).

## 13.3.2 Sufficiency of the Data for Full Multivariate Analysis

Regardless whether a Bayesian conditional method (such as Bayesian logistic regression) or the BN-based approach is used, the posterior probability distribution of sets of predictors indicating their joint relevance with respect to the selected model class (such as the MBS posterior in case of BNs) is typically flat for the settings of contemporary GASs, because of the relation of sample size, number of predictors, effect sizes, and model complexity including prior. Such flat MBS posteriors, ranked from the maximum a posteriori (MAP) MBS to the less probable ones, are illustrated in Fig. 13.5, which shows that there are several sets with only slightly lower probabilities than that of the MAP set. These also indicate that the MAP MBS is not dominant, as its posterior is negligible. Furthermore, the cumulative distribution functions in Fig. 13.5 also indicate the lack of dominant MBSs, i.e., the lack of a small number of MBSs with high posteriors such that their summed posterior is close to one. It needs to be emphasized that these results are the consequences of the power of the data and not of the priors; thus, they also indicate the lack of dominantly optimal model in the frequentist, maximum likelihood approach. In the case of the data set RA and the Asthma target variable, the MAP set has only a probability of 0.010688. (p.333) Because of the smaller sample sizes, when Asthma and Rhinitis are the target variables (i.e., multitarget analysis), the probability of the MAP set is even lower, namely 0.007626. This phenomenon is even more pronounced in the case of data set CLI, where the corresponding probabilities of the MAP sets are 0.001496 (Asthma target), and 0.000073 (multitarget), respectively. These flat MBS posteriors are consistent with our earlier findings from the simulations in [4] suggesting that a 200-sized sample in general induces a very flat posterior distribution (“small sample size”), whereas a 1000-sized sample corresponds to a “medium sample size” with respect to our settings with 100 variables, which is typical in candidate GASs and partial genome screening studies.

Fig. 13.5 Posterior probability and cumulative posterior probability plotted against the rank of the MBSs and the number of MBSs, respectively, for three cases. The three cases are the following: (A,A): the target is Asthma and the data set is A; (RA,RA): the target set contains Asthma and Rhinitis and the data set is RA; (CLI,CLI): the target set contains Eosinophil, IgE, Rhinitis, and Asthma, the data set is CLI. Ranks go from the maximum a posteriori (MAP) MBS to the less probable ones.

The relation between the posterior value p of MBSs and the corresponding ranks r is well described by a power law (cf. Zipf’s law [60]) meaning that the posterior p is proportional to $1/rα$, where α‎ > 0 (Fig. 13.5). The heavy-tailed distribution and the lack of a dominant set or a set of highly significant sets indicate the necessity of refined approaches to cope with weakly significant results, e.g., Fig. 13.5 shows that the cumulative posterior of the 10 000 most probable MBSs is below 0.22.

## 13.3.3 Rate of Learning: Effect of Feature and Model Complexity

The MBM, k-subMBS/k-supMBS, MBS, and MBG features form a hierarchy of increasing cardinality ($|MBM|<|k-MBS|<|MBS|<|MBG|<|BN|$), which is related to the learnability of these levels (for the sample complexity of learning BNs, see [23]). We investigated the effect of this increasing cardinality and computed various performance measures to characterize the learning rate of the simplest and other more complex structural features about strong relevance, namely MBM, MBS, and MBG features. To do this, we performed systematic studies varying the sample size and model complexity using an artificial reference model M0 and an artificial data set generated from this model.

The artificial reference model M0, used in our experiments throughout this chapter, is the maximum a posteriori Bayesian network learned from data set A. In the reference model M0, the Markov blanket of the Asthma target variable consisted of 15 variables. Subsequently, we (p.334) randomly generated an artificial data set from M0, which contained 10 000 independent, identically distributed, random samples from the estimated distribution. From this data set, we created smaller ones by simply taking the first 50, 100, 250, 500, 1000, 2000, and 5000 samples, respectively. Additionally, we defined embedded sets of variables with sizes 20, 40, 60, 80, and 100 ($s20,s40,s60,s80,s100$) to emulate different model complexities.19 These sets always contained the 15 reference input variables and the target variable, which were supplemented with randomly drawn variables not in the reference MBS. Fig. 13.6 shows the effect strengths (odds-ratios20) of the predictors in the reference model M0. Note that there are many weak associations in the reference model M0, i.e., in the real-world data set.

Fig. 13.6 Histogram of the odds-ratios for the strongly relevant and the not-strongly relevant predictors in the reference model M0. SR, strongly relevant (black); not-SR, not-strongly relevant (gray).

The relative flatness of a posterior at a given level generally indicates uninformativeness, i.e., high level of uncertainty; thus the inapplicability of a given level, such as the inapplicability of the level of MBGs, MBSs, or k-subMBSs/k-supMBSs above a given k. The general uncertainty of a posterior distribution can be characterized by its entropy,21 which will be high for a flat, nearly uniform, non-informative posterior distribution. Fig. 13.7 shows the entropy of the distribution of the MBM and MBS features for various model sizes and sample sizes.

Fig. 13.7 Entropy of the distribution of the MBM and MBS features as a function of sample size, for various model sizes. A) MBM. B) MBS. The various model sizes correspond to the different curves.

(p.335) We now present results about learning rates using sensitivity (Fig. 13.8A), false discovery rate (FDR, Fig. 13.8B), misclassification rate (Fig. 13.9B), and area under the Receiver Operating Characteristic (ROC) curve (AUC; Fig. 13.9A).22 Results are calculated at an acceptance threshold of $0.5; i.e., this is the probability above which we accept a factor Xi to be relevant.

Fig. 13.8 Sensitivity and FDR as a function of the sample size, for models with different sizes. A) Sensitivity. B) FDR. The various model sizes correspond to the different curves.

Fig. 13.9 Area under the ROC curve score and misclassification rate as a function of sample size, for various model sizes. A) AUC. B) Misclassification rate. The various model sizes correspond to the different curves.

The trends indicated in Figs. 13.8 and 13.9 (page 336) are summarized in Table 13.3 (page 336). For each performance measure one can select thresholds for acceptable and very good performances, which allow the definition of “small” and “large” sample sizes as the minimal number of samples to reach these thresholds. These thresholds can be selected in our case with a given number of variables and model complexity as follows: 0.6 and 0.9 for AUC, 0.1 and 0.5 for sensitivity, and 0.5 and 0.1 for false discovery rate. The inverse problem of selecting an optimal decision threshold for a given sample size is discussed in Section 13.4. The relatively modest performance and relatively high “small” and “large” sample sizes are the consequences of the abundance of weak associations in the reference model M0.

Table 13.3 The “small” and “large” sample sizes for various performance measures (the decision threshold for MBM posteriors is 0.5).

Sensitivity

FDR

AUC

threshold

N

threshold

N

threshold

N

“small” sample size

0.1

100

0.5

250

0.6

250

“large” sample size

0.5

2000

0.1

5000

0.9

5000

Note: AUC, area under the receiver operating characteristic curve; FDR, false discovery rate.

(p.336) These results are good examples of our experiences from other candidate GASs, which suggest that if the ratio of the number of samples and the number of variables is below ten then the MBS posterior is very flat but the MBM posteriors show peaks. These observations indicate the necessity of the intermediate level of k-subMBSs and k-supMBSs.

## 13.3.4 Bayesian network-based Bayesian Multilevel Analysis of Relevance

In Subsection 13.2.1 (page 320), we defined different relevance types that can be used to infer strongly relevant variables, either separately from each other (MBM) or jointly in a complete set (i.e., in an MBS). Moreover, in case of the MBG-based relevance type even the interactions among those strongly relevant variables can be investigated. In this section, we provide some characteristics of these relevance types and show how each of them can be used in a genetic association study to reason about the relevance of the predictors.

Two complementary approaches can be distinguished in relevance analysis depending on their focuses. The first approach provides an overall characterization of relevance relations using a limited number of features and feature values, such as pairwise edges and Markov blanket membership relations [20, 21]. The second approach is at the other extreme of feature learning as it provides a detailed characterization of dependence relations between the target variables and the (p.337) predictors. Here we find the identification of MBSs and MBGs, which is also related to the learning of arbitrary subgraphs with statistical significance [43]. To understand this whole spectrum, first recall that the MBM, k-MBS, MBS, and MBG features are of increasing complexity. Consequently, the MBG and MBS posterior distribution is often too “flat” (i.e., there are hundreds of MBS or MBG features whose summed probability is relatively high but whose individual posteriors are low) even when the MBM posterior distributions $P(MBM(Y,Xi,G)|DN)$ show peaks. To illustrate the relation between the two approaches, we report the approximation of the (set-based) MBS posterior $P(mbs(Y,G)|DN)$ using the (pairwise) MBM posteriors as follows:

(13.8)
$Display mathematics$

Fig. 13.10 shows that the MBM-based approximation allows only a rough quantitative estimation, and the corresponding ranking differs significantly (the difference is particularly pronounced in most practical cases when the sample is relatively small).

Fig. 13.10 The ranked MBS posteriors and their MBM-based approximations. Ranked MBS posteriors: black line; MBM-based approximations: gray line.

The Bayesian multilevel analysis of relevance introduced scalable intermediate levels to provide a comprehensive view over multiple levels. It was motivated by the observation that typically—even when the MBG and MBS posterior distributions are flat—the most probable MBSs and MBGs share significant common patterns. We introduced the concept of sub-relevance, denoted in this chapter as k-subMBS (see Definition 13.6 (page 326) and [4]), to characterize the common elements. Typically, these common variables are present in MBSs with high posteriors, and they usually have larger effect sizes. The posterior probability of the sub-relevance of a subset s is:

(13.9)
$Display mathematics$

where the first term is the exact MBS posterior of s and subsequent terms, and behind the summation sign, are the MBS posteriors of each proper superset of s. The posterior $p_(s|DN)$ expresses the probability that s contains only strongly relevant variables. If s is the common subset of almost all (p.338) highly probable MBSs, then its sub-relevance posterior will be high. On the contrary, if s is not the subset of the most probable MBSs, then the posterior will be low.

We demonstrate the use of the k-subMBS concept in the Asthma domain. In this domain, the MBS posterior distributions were very flat, whereas the MBM posterior distributions were very rough, which indicates that analysis at intermediate levels of k-subMBSs can unhinge significant results. Thus, we evaluated the partial multivariate results illustrated in Fig. 13.11. In the cases of k = 1,2,3,4, the high maximum a posteriori probability (which corresponds to relatively rough posterior distribution) indicates that the sample size is sufficient to infer that these variables are strongly relevant jointly. In contrast, for k > 4, the maximum a posteriori multivariate features are weakly significant. These results are in line with the expectation that, with increasing feature cardinality, the posterior distribution progressively flattens out. The posteriors corresponding to the polynomially increasing cardinality of the k-subMBS bridge the gap between the flat posterior of the MBS and the MBM posteriors characterized by the presence of numerous peaks.

Fig. 13.11 Posteriors for MBSs and k-subMBSs for k = 1, …,5 in the case of data set CLI and target Asthma.

Analogously, we define the concept of k-supMBS to characterize the not strongly relevant variables typically not present in MBSs. The posterior probability of sup-relevance is computed analogously to that of sub-relevance (equation (13.9)):

(13.10)
$Display mathematics$

The posterior $pˉ(s|DN)$ expresses the joint probability that all strongly relevant variables are in set s ($Xi∈s$) (equation 13.11). Equivalently, none of the $Xi∉s$ are strongly relevant (equation (13.12)):

(13.11)
$Display mathematics$
(13.12)
$Display mathematics$

In summary, a k-subMBS with high posterior contains variables that are strongly relevant with high confidence (“necessary” variables), and the complement of a highly probable k-supMBS (p.339) contains variables that are not strongly relevant with high confidence (i.e., k-supMBS contains a “sufficient” set of variables). The strong relevance of any individual $Xi∉s$ can be excluded with this probability (although $1−P(MBM(Y,Xi,G))$ gives a sharper value as the exact posterior probability of being not strongly relevant).

Note that for a given precedence (ordering) of variables ≺, the sup-relevance $pˉ$ for k = 0,1, … increases monotonously from 0 to 1, and $p_$ decreases monotonously from 1 to 0 as we progressively extend the empty set to the complete set following that precedence (see Fig. 13.12).

Fig. 13.12 Illustration of sub- and sup-relevance posteriors using data set A and target Asthma. The sub-relevance $p_$ and the sup-relevance $pˉ$ of the k-sized sets containing the k variables with the k largest MBM posteriors for k = 1, …,100 are indicated by black and gray solid lines, respectively. The maximal sub-relevance $p_$ and the maximal sup-relevance $pˉ$ over the k sized sets are also indicated (“max $P$(k-subMBS)”, “max $P$(k-supMBS)”). For these sets, the MBM-based approximations of their posteriors according to equation (13.8) are also shown, denoted by “$P$(k-subMBS) by MBM approximation” and “$P$(k-supMBS) by MBM approximation”, respectively.

## 13.3.5 Posteriors for Multiple Target Variables

The definition of strong relevance to multiple targets is a straightforward extension from strong relevance to a single target variable (see Subsection 13.2.5, page 13.2.5, Definition 13.7). However, the posterior for a given target set Y cannot be calculated in general from the posteriors corresponding to the members of any partitioning of $Y=⋃iYi$, although the posteriors corresponding to subsets of the target set can be used for an approximation. In the case of MBMs, the corresponding approximation is:

(13.13)
$Display mathematics$

We computed and compared the posteriors of each directly relevant variable Xj to each target variable Yi in order to decompose the relevance of genetic factors to various phenotypic target variables. These targets were IgE (level), Eosinophil (level), Rhinitis, and Asthma, which are believed to participate in a complex causal model with multiple paths (see Fig. 13.3, page 327). Since in this problem domain the target variables form a causal model with strong dependences, we can expect that the relevance of a factor Xj to a given target is not independent from that to (p.340) other targets. Indeed, the multitarget approach is motivated by the finding that equation (13.13) provides a rather poor approximation (Fig. 13.13). Note that the value of posteriors for multiple targets is higher than that for single targets due to the possibility of relevance to any of the targets, and so a quantitative comparison correction is necessary.

Using the rs17831682 SNP in the PTGDR gene as an example (see PTGDR(1) in Fig. 13.3), we demonstrate the main advantages of the multitarget approach, namely that it allows us to distinguish between subtypes of multitarget relevance, which were summarized previously in Table 13.2 (page 328). When multitarget relevance is ignored, the posterior of strong relevance (i.e., the MBM posterior) of rs17831682 to IgE (level), Eosinophil (level), Rhinitis, and Asthma is 0.58, 0.52, 0.53, and 0.53, respectively, which indicates a somewhat moderate relevance to each target. The posterior for being strongly relevant to at least one of them (MBMToAny relation in Table 13.2) is higher: 0.71 (the approximation according to equation (13.13) is 0.95). However, the posterior probability that rs17831682 is strongly relevant exclusively to IgE, Eosinophil, Rhinitis, or Asthma (MBMToExactlyOne relation) is only 0.06, 0.04, 0.05, and 0.05, respectively, which shows that this SNP is probably related to multiple targets. This hypothesis is also supported by the posterior that this SNP is strongly relevant to other targets but not to IgE, Eosinophil (level), Rhinitis, or Asthma (MBMWithOthers relation): 0.37, 0.42, 0.42, and 0.42, respectively. Finally, the posterior for rs17831682 being a relevant SNP for multiple phenotypic targets (MultipleMBMs relation) is high (0.51), suggesting that this SNP is strongly relevant to the set of targets and that the SNP plays roles in multiple mechanisms.

Fig. 13.13 Posteriors of the MBM relevance for multiple target variables jointly (multitarget), separately (IgE, Eosinophil, Rhinitis, and Asthma) and with respect to the single-target approximation (approximation).

## 13.3.6 Subtypes of Strong and Weak Relevance

The distinction between different types of relevance is essential in revealing the possible causal and mechanistic path(s) that connect the relevant SNP to its target variable(s). Estimating the posterior probability of various relevance types enables us to decide whether a SNP is directly relevant or its association is mediated by other factors or both. We demonstrate the dissection of relevance types in a BN-BMLA analysis conducted on data set RA, which contains two phenotypic variables: Asthma and Rhinitis. Using Asthma as the sole target, the following posteriors were (p.341) estimated for each SNP: direct causal relevance (DCR), association (A), strong relevance (SR), interactionist relevance (IR), and indirect causal relevance (ICR). The corresponding posteriors for a number of SNPs are shown in Table 13.4.

Fig. 13.14 Comparison of posteriors for various relevance types of SNPs of Table 13.4. The posteriors segregate the SNPs into four groups. A) Direct relevance: PRPF19(1). B) Interactionist relevance: AHNAK(2), TXNDC16(1). C) Transitive relevance: PTGDR(2), PTGER2(2). D) Confounded relevance: WDHD1(1). A, association; DCR, direct causal relevance; ICR, indirect causal relevance; IR, interactionist r; SR, strong relevance.

Table 13.4 Posteriors for various types of relevance based on data set RA using Asthma as the only target. Each line contains the posteriors for a SNP identified by the first column, e.g., PTGDR(2) refers to the second SNP measured within the PTGDR gene. Subsequent columns correspond to relevance types in the following order: association (A), direct causal relevance (DCR), strong relevance (SR), interactionist relevance (IR), indirect causal relevance (ICR).

SNP

A

DCR

SR

IR

ICR

AHNAK(2)

0.643

0.029

0.736

0.708

0.535

TXNDC16(1)

0.309

0.008

0.722

0.713

0.189

PRPF19(1)

0.878

0.718

0.843

0.125

0.822

PTGDR(2)

0.923

0.326

0.362

0.035

0.747

PTGER2(2)

0.970

0.350

0.354

0.004

0.604

WDHD1(1)

0.964

0.115

0.242

0.127

0.202

In our current example, the SNPs can be clustered into four groups as shown in Fig. 13.14 (page 13.3.7). Note that association, direct relevance, transitive relevance, and interactionist relevance are complex, potentially overlapping events (see Fig. 13.2, page 325). The SNPs AHNAK(2) and TXNDC16(1) both have a moderately high posterior for strong relevance (0.736 and 0.722) but a very low direct causal relevance posterior (0.029 and 0.08). This means that the strong relevance of these SNPs to Asthma is not due to direct causal relationship but to a purely interactionist relevance with Rhinitis. Furthermore, the fact that the posterior for a transitive relationship with Asthma is relatively low (the posterior is 0.535 and 0.189 for AHNAK(2) and TXNDC16(1), respectively) indicates that interactionist relevance (posterior: 0.708 and 0.713) is the only relevance subtype underlying the association of these SNPs to Asthma. This means that these SNPs are relevant and associated only if the Rhinitis state is known.

In contrast, PRPF19(1) is associated with Asthma (0.822) not only transitively but also through a direct causal relationship (0.718) suggesting two distinct causal paths connecting PRPF19(1) to Asthma: one of the paths can be blocked by other factors but the other one cannot. Its posterior for interactionist relevance is low (0.125), suggesting the lack of a third causal path.

In the third group of SNPs, PTGDR(2) and PTGER2(2) have a very high probability of association with Asthma (0.923 and 0.970 respectively), which is induced by a transitive relation indicated by moderately high TCR posteriors (0.747 and 0.604). Note that all other posteriors are relatively low, indicating that TCR is the only significant relevance type in this case.

WDHD1(1) contrasts with all other SNPs in the previous groups because it has a high probability of association with Asthma (0.96) but none of its other posteriors are significant. This is possible in the case of a pure confounding relationship, in which a common cause influences both the SNP and the target (which are independent from each other otherwise). Note that if transitive (p.342) dependence and confounded dependence cannot be differentiated, such as in the case of linked SNPs, the transitively relevant and the confounded groups can be merged.

For a more detailed biomedical discussion of the application of this methodology in asthma and allergy, we refer the reader to [58].

## 13.3.7 Interaction-redundancy Scores Based on Posteriors of Strong Relevance

The MBM posterior probabilities $P(MBM(Y,Xi,G))$ characterize the strong relevance of individual variables $Xi∈V$ (equation (13.2), page 329). However, $P(MBM(Y,Xi,G))$ inherently reflects multivariate aspects as it is derived from a full Bayesian multivariate analysis through marginalization from the MBG posterior distribution (equation (13.6), page 329). It means that even a variable Xi which is independent of variable Y can have high MBM posterior. The reader is referred to Subsection 13.2.3 (page 323) for a discussion of Definition 13.5 (page 325) regarding the relationships connecting interactionist relevance, contextual relevance, and the relevance of a variable with no or negligible main effect.

(p.343) However, the exploration of statistical interactions is an additional challenge, because the conceptualizations of statistical interactions are based on the joint, non-linear effect of the interacting variables (for overviews, see e.g., [14, 45]; for methods, see e.g., [35, 40, 49, 65]). In the BN-based Bayesian framework, the joint relevance of the set of variables is quantified by its MBS posterior. Thus, we can define an interaction-redundancy score based on the decomposability of the posterior of the strong relevance of a set of variables [4].

Fig. 13.15 Interaction-redundancy scores based on the dependence of the posteriors being strongly relevant for the Asthma target using data set A. The thickness of the curved edges connecting two SNPs Xi,Xj shows the strength (absolute value) of the interaction-redundancy score $|IRS(Xi,Xj;Asthma)|$, and dashed and solid lines indicate interaction (positivity) and redundancy (negativity) of IRS(Xi,Xj;Asthma) respectively. The bars on the internal circle show the MBM posteriors. The segments of the two external rings show the genes and chromosomes of the SNPs respectively. A) All SNPs are shown. B) Only a selected subset of SNPs with high MBM posterior is shown.

Definition 13.8

For a given data set DN with sample size N, the features $X′={Xi1,…,Xik}$ are structurally interacting (redundant), if the posterior $P(k-subMBS(X′,Y,G)|DN)$ is larger (less) than the MBM posterior-based product $∏j=1kP(MBM(Xij,Y,G)|DN)$.

The interaction-redundancy is quantified by the following score

(13.14)
$Display mathematics$

The interaction-redundancy score is illustrated in Fig. 13.15. This model-level approach to interaction and redundancy formalizes the intuition that relevant input variables with decomposable roles at the parametric level appear independently in the model. If the k-subMBS posterior of set s is larger than its approximation based on MBM posteriors according to equation (13.8) and equation (13.9), it may indicate that the variables in set s have a joint parameterization expressing non-linear joint effects. In contrast, in the case of k-subMBS including redundant variables, the (p.344) posterior is smaller than its approximation based on MBM posteriors, because the joint presence of the redundant variables in the model is suppressed.

Note that the interaction-redundancy scores corresponding to a given target do not appear to be related to the genetic linkage between the SNPs. Fig. 13.15 clearly shows that there are several intragenic, intrachromosomal, and interchromosomal interactions in this domain exemplified by rs17197 and rs708502 in the PTGER2 gene (chromosome 14), rs12587410 in the PTGER2 gene and rs376966 in the DLG7 gene (both genes in chromosome 14), and rs11827029 in AHNAK (chromosome 11) and rs17831675 in the PTGDR gene (chromosome 14).

# 13.4 Bayes Optimal Decisions about Multivariate Relevance

The relatively high number of predictor variables in GASs poses a serious challenge, because of the multiple hypothesis testing problem: in the univariate approach, the number of hypotheses is linear in the number of variables. Additionally, the number of hypotheses can be exponential in the multivariate approach using a complex model class. Several approaches have emerged within the frequentist framework to handle the multiple hypothesis testing problem in both the univariate and the multivariate contexts. These approaches include correction methods, permutation test-based approaches, and involve concepts such as false discovery rate (FDR) and q-value23 [54].

Because of its direct semantics, the Bayesian multivariate approach has a built-in automated correction for the multiple hypothesis testing problem: the posterior typically is more flat with increasing number of variables and with increasing model complexity, i.e., in a more complex hypothesis space.

Furthermore, the Bayesian decision theoretic framework permits optimal decisions about model properties, such as the optimal scientific reporting of results or the optimal continuation of the study (for Bayesian study design using BNs, e.g., see [2, 62]). First, we summarize the optimal decision problem of the relevance of variables based on their univariate posteriors and utilities. Second, we show the application of the Bayesian approach to construct a Bayesian FDR. Third, we consider the use of general informative-loss functions.

## 13.4.1 Optimal Decision about Univariate Relevance

As we discussed in Subsection 13.3.4 (page 336), the univariate view of relevance using the MBM posteriors allows an overview, but it provides only a raw approximation for the MBS posteriors (see Fig. 13.10, page 337). We can construct a 2× 2 cost matrix for the errors of a positive/negative decision about univariate strong relevance, when the real status of relevance is positive/negative: $C1|0$ is the cost of the false positive decision, $C0|1$ is the cost of the false negative decision, and for (p.345) simplicity we assume that the other costs are $C0|0=C1|1=0$. The positive decision about the strong relevance of a given variable Xi with respect to the target variable Y is optimal if

(13.15)
$Display mathematics$

where $τ∈[0,1]$ is the optimal decision threshold. For example, if the cost $C1|0$ of doing further measurement on a non-relevant variable (a false positive decision) is equal to the cost $C0|1$ of missing a discovery (a false negative decision), then the optimal decision threshold τ‎ is 0.5.

Fig. 13.16 (page 346) shows the ROC curves of the MBM-based decision of relevance using the reference model M0 from Subsection 13.3.3. The ROC curves show the sensitivity and specificity values for various sample sizes, considering reference variable sets $s20,s40,s60,s80$, and s100 with varying number of variables.

Fig. 13.16 The ROC curves for various model sizes used in the evaluation framework from Subsection 13.3.3. The legend corresponds to the number of samples in the data set. Increasing the number of samples provides greatly increased sensitivity. A, B, C, D, and E correspond to models with 20, 40, 60, 80, and 100 variables, respectively.

## 13.4.2 Optimal Bayesian Decision to Control FDR

The measures of classification performance, such as sensitivity, FDR, and AUC are valuable tools, but they need an external reference, a “gold standard,” which is typically available in an evaluation context (for a recent comparison of measures, see e.g., [55]). The classical frequentist approach also assumes that there is an unknown reference set, the “true model” beneath our data. However, the Bayesian framework offers a natural solution for the lack of a reference model, based on Bayesian model averaging (BMA).

In the decision theoretic framework, sensitivity, specificity, FDR, negative predictive value (NPV) and other performance measures can be interpreted as special utility or cost functions based on the selected set $sˆ$ and the possible set s (for which posterior exists). As an example for a cost function, FDR is the proportion of members in the selected set $sˆ$ not present in the possible set s, and sensitivity (i.e., true positive rate, TPR) is the frequency of members in the selected set $sˆ$ present in the possible set s:

(13.16)
$Display mathematics$
(13.17)
$Display mathematics$
(13.18)
$Display mathematics$
(13.19)
$Display mathematics$

In the Bayesian framework, the corresponding expected values (denoted with upper bar) are well defined and can be approximated using MCMC simulation (see Subsection 13.3.1, page 330). For example, the expected value of the FDR corresponding to the decision for selecting $sˆ$ is as follows:

(13.20)
$Display mathematics$

This calculation sums over possible scenarios with pairs $(sˆ,s)$, where the set $sˆ$ is fixed and the possible set s is changing. The probability of a scenario is defined by the posterior $P(s|DN)$ of the (p.346) corresponding possible set. Because the loss is defined by $FDR(sˆ|s)$, equation (13.20) defines the expected loss for a selected $sˆ$, allowing a search for an optimal set $sˆ∗$ minimizing the expected FDR loss.

Fig. 13.17 shows the expected values of these measures using data set A and the Asthma target.

(p.347) The availability of the FDR value for any set allows the definition of the Bayesian discovery threshold called the d-value, a Bayesian analogue of the q-value24 [54] from the frequentist approach [54].

Definition 13.9

The d-value for a given variable Xi is the minimal expected FDR value corresponding to the multivariate selection of a relevant set s including Xi: $d-value(Xi)=mins:Xi∈sFDR‾(s)$.

Fig. 13.17 Expected value of FDR and Sensitivity for the 10 000 most probable MBS sets, using data set A and Asthma target. A) Expected value of FDR. B) Expected value of Sensitivity.

Note that the expected FDR value of a set $sˆ$ containing only one variable Xi is $1−P(MBM(Y,Xi))$, which is minimal for the variable with the maximal MBM posterior value ($XMBM=argmaxXiP(MBM(Y,Xi)|DN)$). However, the MBS posterior of this set $sˆ={XMBM}$ is typically very small, and sets with high MBS posteriors typically contain further variables with high MBM posteriors. The identification and construction of the set $sˆ$ with minimal FDR is supported by the observation that FDR can be decomposed as follows:

(p.348)

(13.21)
$Display mathematics$
(13.22)
$Display mathematics$
(13.23)
$Display mathematics$
(13.24)
$Display mathematics$
(13.25)
$Display mathematics$

where $sˆi$, si denote the presence of the predictor Xi in the set (as always, the capital Si denotes the corresponding random variable).

In fact, the expected value of any linear performance measure $U(sˆ|s)$ can be computed efficiently, because the following proposition holds:

Proposition 13.1

Given the linear utility function $U(sˆ|s)=∑iλiUi(sˆi|si)$, the expected value of a selected set $sˆ$ is $EU(sˆ)=∑iλiEU(sˆi)$, where $sˆi$ and si denote $Xi∈sˆ$ and $Xi∈s$ respectively, and $λi∈R$.

Consequently, as the following theorem shows, the d-value(Xi) can be defined at the univariate level of the predictors using only their MBM posteriors $MBM(Y,Xi)$. Furthermore, the optimal set contains the variables with maximal univariate relevance.

Theorem 13.3

Because of the linear property of the FDR and proposition 13.1, the d-value(Xi) is $1|s|∑Xj:Xj∈s1−P(MBM(Y,Xj))$, where the optimal set s contains Xi and all Xj with higher posteriors (i.e., $s={Xi}⋃{Xj:P(MBM(Y,Xi))).

Fig. 13.18 shows the expected values of FDR for sets $sMBM,k∗$ with increasing size containing k variables with highest MBM posteriors. Because of Theorem 13.3, these are the d-values. Note that these sets are typically different from the most probable MBSs (sets s with high MBS posterior, because of the difference between the MBS posterior ($P(MBS(Y,G)=s|DN)$) and its MBM posterior-based approximation in equation 13.8 (see Fig. 13.10, page 337). This is because of the potential dependences of the univariate relevances $MBM(Xi)$ and the estimation errors of the finite MCMC sampling process.

Fig. 13.18 Expected values of Sensitivity, Specificity, FDR, and NPV corresponding to the sets $sMBM,k∗$ with minimal FDR for various set sizes. The figure also shows the MBM posteriors ($P$(MBM)) of the predictors. The FDR curve can also be interpreted as d-values for the predictors (see Theorem 13.3).

## 13.4.3 General Bayes Optimal Decision about Multivariate Relevance

The most important factor in a general utility/loss function $U(sˆ|s)$ designed to select the set of variables $sˆ$ is the penalty for the difference between sets $sˆ$ and s using various loss functions, e.g., the “0–1” (L0), linear (L1), or quadratic (L2) losses. The effects of these loss functions can be illustrated in the continuous domain, where the optimal values corresponding to these losses are (p.349) the maximum, median, and average values, respectively. For example a quadratic loss function with uniform weights and its corresponding expected value is defined as

(13.26)
$Display mathematics$
(13.27)
$Display mathematics$

where ()c denotes the complement and the capital S denotes the corresponding random variable. Note that the terms in equation (13.26) are the true positive, true negative, false negative, and false positive counts, which directly correspond to the performance measures shown in Fig. 13.18 (see equation (13.6)). Fig. 13.19 (page 350) illustrates the non-monotonic relation of the MBS posteriors and the expected utilities corresponding to equation (13.26).

Fig. 13.19 MBS posteriors plotted against the ranks of the sets, where the ordering is defined by the expected L2 utilities. For the definition of the L2 utility, see equation (13.26).

The MBS posteriors (y-axis) are plotted against the ranks (x-axis) of the sets, where ordering is defined by their expected L2 utilities (see equation (13.26)).

Other factors in a general utility/loss function $U(sˆ|s)$ can incorporate and score many aspects to select the set $sˆ$. Such aspects are as follows:

1. 1 Statistical measures: To characterize the statistical aspects of the selected set, we can combine factors such as FDR and NPV from equation (13.16) and equation (13.20).

2. 2 Univariate importance: Depending on the domain, the cost of false positive and false negative errors can be defined separately for each variable Xi, i.e., using a linear utility function from Proposition 13.1 (page 348).

3. 3 Structural aspects (interactions): The utility of set $sˆ$ can also be defined based on the structural aspects of the reference using either the full network $(U(sˆ|G))$ or an mbg subgraph $(U(sˆ|mbg)$), (p.350) e.g., to express that set $sˆ$ as a set of variables includes factors of interactions represented by the graphical models.

4. 4 Informative consistency25 and completeness26: The internal consistency and completeness of set $sˆ$ can be quantified based on domain-specific knowledge, thus providing a factor $U(sˆ)$ without referring to the possible set. It can, for example, express that predictors in set $sˆ$ correspond to and completely cover one or more biological entities or mechanisms.

# 13.5 Knowledge Fusion: Relevance of Genes and Annotations

In genetic association studies, a key challenge is the incorporation of a wide range of information to transform and interpret the data at higher abstraction levels, such as at the level of haplotypes, genes, and proteins through “positional” or “functional” integration; at the level of pathways and cascades through “functional” integration; and “temporal” integration of earlier results to approximate an ideal meta-analysis. Whereas the fusion of knowledge at the level of genes, proteins, and pathways is an open research question, the data analysis workflow suggests the following scenarios. Fusion can be done either in the preprocessing phase by creating new synthetic variables to be used in the data analysis, or in the interpretational (postprocessing) phase. The haplotype level analysis is a well-known example for the first approach, because it is regularly done offline in a preprocessing phase or integrated into the data analysis itself (see e.g., HapScope [64], HAPLOT [27], GEVALT [17], and PHASE [53]). However, we can also construct synthetic variables $fS(X_)$ at the level of genes, proteins, and pathways, to represent various functional effects analogously to the preprocessing approach with haplotypes. If we use the synthetic variables (p.351) in the Bayesian relevance analysis, then we can estimate the posterior over their joint strong relevance, and the posterior for all the partial relevances, e.g., univariate relevances $P(fS|DN)$ can be estimated.

In the postprocessing phase, a more common approach is to use synthetic variables $fS(X)$ only to support interpretation. The mapping of the genetic variants to genes and then the use of Gene Ontology (GO) above the gene level is a practically important case. The most commonly used approach in the frequentist framework is to combine p-values for SNPs into an overall significance level to represent a gene and to combine p-values for the genes into an overall significance level to investigate the association of a pathway with the disease (such combination methods are the Fisher’s, Sidak’s, or Sime’s combination tests [44]). Note that these combination methods are vulnerable to correlations between the entities, and they cannot cope with complex interactions.

The Bayesian framework and the Bayesian networks offer a principled solution for the multivariate propagation of uncertainties from the level of strong relevance of the predictors to the annotations of the predictors and to the taxonomies over the predictors, i.e., to induce a posterior over related terms from the MBS posterior allowing uncertain relations in the annotations and taxonomies as well.

Let us assume a taxonomic tree $T$, where each leaf node corresponds to a given indicator variable representing the strong relevance of a given variable in the analysis, the internal nodes correspond to indicator variables of terms in the taxonomy, and the edges represent the taxonomical relations directed from specific to general forming a polytree, e.g., SNPs to genes. For each term $Aj,T$ in the set of terms $AT$ present in the taxonomic tree $T$, the tree defines a corresponding indicator function over the predictors $Xi∈X$ $Aj,T(Xi):X→{0,1}$ (if context allows, we omit the index $T$ and use the term and its indicator function interchangeably):

(13.28)
$Display mathematics$

Based on these relations, a multivariate “semantic strong relevance” relation can also be defined representing the semantically related annotations $sA$ for a given set of predictors s, $SSRT(s):2X→2A$ as follows

(13.29)
$Display mathematics$

We will generalize this logical taxonomy to a BN, but the current definition in equation (13.29) simply states that if the predictor Xi in set s is connected to the term Aj through a directed path, then Aj belongs to the set of “semantically related annotations” associated with s.

In the Bayesian statistical framework, the multivariate MBS posterior over sets of predictors ($P(MBS(Y,G)=s|DN)$) induces a multivariate posterior distribution over the sets of annotations $sA$ as follows:

(13.30)
$Display mathematics$

where the posterior of a given annotation/term set $sA$ is the sum of the posterior of the sets of predictors s for which $SSRT(s)=sA$.

(p.352) The posterior probability of an annotation Aj is as follows:

(13.31)
$Display mathematics$

The interpretation of equation (13.31) can be illustrated by the following example. Let us assume that Aj() represents the correspondence of SNPs to a given gene gj. Then equation (13.31) follows the intuition that the posterior relevance of the gene gj is the posterior probability that at least one SNP in this gene is strongly relevant. The multivariate transformation defined in equation (13.31) is typically different from various ad hoc averaging and maximization-based aggregations of the MBM posteriors corresponding to the leaves of the annotation term. This is the consequence of the fact that the multivariate approach exactly takes into account the dependences between the strong relevances of the predictors corresponding to the leaves.

The structure of the taxonomy and prior domain knowledge can also be used to refine the semantic relevance relations. We can interpret the taxonomy $T$ as a special BN, where the local parametric models are logical OR relations. In this model, the posterior for the multivariate semantic strong relevance relation can be interpreted as the result of an inference process with hard evidences at the leaves, which correspond to the indicator variables representing the strong relevance of analyzed variables. However, this BN representation, which mixes predictors (e.g., SNPs) and terms, allows the incorporation of more background knowledge, e.g., using Noisy-OR local parametric models, in which the effect of the true state of a given input is inhibited with a given “inhibition probability” [41]. The parameters in the Noisy-OR models can represent the in- and out-degrees in the taxonomy, e.g., if a given term is annotated by many genes, and therefore its in-degree is relatively high, then the parameters are set to smaller values to adequately model the generality of the term. Similarly, if a given gene is annotated by many terms, and therefore its out-degree is relatively high, then the parameters can be set to smaller values to model the higher frequency of the gene.

Fig. 13.20 Relevant functional terms of the Gene Ontology biological process sub-ontology as calculated from data set A for the target variable Asthma. The posterior probability of the strong relevance of the GO terms was aggregated from the level of SNPs. The terms are visualized as a network. The nodes are the terms, and the connections indicate the hierarchy of the ontology. The sizes of the nodes are proportional to the posterior probability of the strong relevance of the term to the target variable Asthma.

In Fig. 13.20, we demonstrate the results of the aggregation from the level of SNPs to the level of Gene Ontology biological process terms. The posterior probability of the MBSs calculated from data set A was aggregated to the level of genes by considering the physical loci and the functional roles of the SNPs. Then, we aggregated these results to the level of GO terms considering the annotations of the genes. The results can be visualized as a network whose nodes are the functional terms, and the connections between the nodes correspond to the hierarchy of the ontology. The sizes of the nodes are proportional to the posterior probability that the given functional term represented by the node has a functional role in the biological phenomena under investigation.

# 13.6 Conclusion

The Bayesian approach provides a unified framework for study design, integrated exploratory data analysis, optimal decision, and knowledge fusion in genetic association studies. Probabilistic graphical models, and particularly Bayesian networks, allow the decomposition and refinement of the overloaded concept of association. Bayesian networks in the Bayesian framework allow the inference of posteriors over multivariate strong relevance, interactions, global dependence, and causal relations, optionally with various specializations for multiple targets. Furthermore, the Bayesian network-based Bayesian MultiLevel Analysis (BN-BMLA) of relevance in GAS allows (p.353) scalable intermediate levels between univariate strong relevance and full multivariate relevance to interpret the results at partial multivariate levels; additionally, at each level, relevance can be analyzed from a dual point of view of necessity (k-subMBS) and sufficiency (k-supMBS).

The application of the Bayesian decision theoretic framework for the results of BN-BMLA from the data exploration phase opens up new possibilities to incorporate domain knowledge in supporting interpretation and potentially automating the discovery of interesting relations. The Bayesian framework also allows the principled and computationally efficient management of FDR and other performance measures.

The Bayesian statistical framework also offers a normative solution for the multiple hypothesis testing problem, which is caused by the high number of predictors and particularly by the number of interactions within the frequentist framework. This statement also remains true for the much richer hypothesis space of the new relevance relations defined in the language of BNs, such as the MBMs, k-subMBSs/k-supMBSs, MBSs, and MBGs. Within the Bayesian framework, the more-or-less flat posterior is the consequence of the high number of variables and high number of models, which is an analogue of the loss of power in the frequentist framework, because of the corrections for the high number of variables and high number of models. However, there is a fundamental difference between the two approaches, which is very valuable in the biomedical application: the Bayesian approach, specifically Bayesian model averaging,27 provides a normative method for the derivation of posteriors for complex hypotheses, such as k-subMBSs/k-supMBSs, MBSs, MBGs or the semantic strong relevance. This is especially important in data and knowledge fusion, which is the main bottleneck in current biomedical/translational research.

(p.354) Nonetheless, the posterior estimates for BN features, such as MBMs, k-subMBSs/k-supMBSs, MBSs, and MBGs, still suffer from the multiple hypothesis testing problem, because the MCMC process itself, i.e., their estimation, is done in the frequentist framework. But this problem is related mainly to the efficiency and length of the MCMC simulation, i.e., to the sampled DAGs in DL^G, and not to the data set DN. In other words, the Bayesian statistical framework transforms the statistically rooted multiple hypothesis testing problem into a computational task.

Fusion is an already well-recognized central challenge in genetic association studies. With the spread of next-generation sequencing technologies targeting rare variants, the importance of fusion will further increase. Genetic factors have a hierarchical taxonomy, starting from SNPs and moving up to genes then GO terms and pathways. We can expect a similar hierarchical taxonomy to emerge over the phenotypic descriptors as well, such as the Human Phenotype Ontology [46]. Because genetic factors are typically predictors and phenotypic descriptors are typically targets in the BN-BMLA methodology,28 the methodology can be seen as a support to analyze relevance at multiple granularity and multiple abstraction levels.

The advantage of the direct probabilistic semantics of the Bayesian statistical approach allows a mathematically direct and biomedically interpretable way to combine the results of data analysis with logical prior knowledge (for the aggregation of BN-BMLA results at the SNP level to gene and pathway levels, see [36, 58, 59]). In addition to the propagation of the posteriors to upper levels by aggregation, it also allows the construction of Bayesian data analytic knowledge-bases to support the fusion of weakly significant results from multiple data analyses.

• Area under the receiver operating characteristic curve (AUC)

The AUC is a widely used measure of the ranking performance of a system with respect to a classification [29]. A plausible interpretation of the AUC can be explained as follows: if we select a random positive sample and a negative sample, the AUC is the probability that the method assigns a higher rank to the positive sample. The AUC value 0.5 corresponds to absolutely random selection; 1 corresponds to the perfect prediction.

• Bayesian model averaging (BMA)

The posterior distribution over Bayesian network structures given N observations $P(G|DN)$ allows averaging in prediction and in exploration of model properties. In the latter case, we can calculate the posterior probability of any structural feature $F(V′,G)=f$ by summing the posterior probability of models with property (feature) f (see equation (13.7) in Section 13.3, page 328). In fact, if the structural feature $F(V′,G)$ is interpreted as a discrete transformation (mapping) $t(x):X→Y$, then this corresponds to the inducement of a distribution over set Y from the distribution over X, i.e., transformation of the posterior distribution over the model space $P(G|DN)$ into a posterior distribution $P(F|DN)$ over simpler (less numerous) feature space (see equations (13.2) and (13.7) in Section 13.3, page 328).

• (p.355) Causal model

A model of a given set of random variables is said to be causal when edges between two nodes are present, if and only if the parent is a direct cause of the child (for in depth treatment see [42, 51]).

• Contextual independence

Contextual independence is a specialized form of conditional independence, when the conditional independence of two variable sets A and B conditioned on a variable set Z is valid only for a certain value c (i.e., the context) of another disjoint set C.

• d-separation

d-separation is a graphical test of independence between variables in a Bayesian network structure. Two sets of variables A and B are d-separated by a third set Z if and only if any path between A and B is blocked by a variable in Z. A path is blocked by a variable z if it is part of the path with non-converging edges, or if neither z nor any of its descendants is part of the path with converging edges. In Bayesian networks, d-separation implies conditional independence.

• Domain

In the context of Bayesian networks, the domain is the segment of the world that a Bayesian network tries to model. A Bayesian network is said to represent a given domain if it models all the necessary knowledge of the domain. Represented/modeled entities of the domain appear as nodes in the Bayesian network.

• Feature subset selection

(FSS) FSS is the task of selecting an optimal subset of features/properties from a larger set according to some external optimality criterion. A good example is to select a subset of tests to perform so that one achieves optimal cost—information gain balance. FSS is often viewed as a search problem in a space of feature subsets. To carry out this search, one must specify a starting point, a strategy to traverse the space of subsets, an evaluation function (induction algorithm), and a stopping criterion.

• Filter approach

In the feature subset selection problem, the filter approach selects features using a preprocessing step that relies solely on properties of the data without taking into account the induction algorithm itself. This approach may lead to feature selections which in practice will perform poorly when evaluated by the induction algorithm. An example is to use the Pearson product–moment correlation coefficient to filter the predictors.

• Independence map

A directed acyclic graph G is an independence map of a distribution if the all Markov assumptions (independence relations) implied by G are satisfied by the distribution.

• Indicator function

An indicator function defined over a set X indicates membership of an element in a subset A of X, assigning the value 1 for all elements of A and the value 0 for all elements of X not in A.

• Induction algorithm

In the feature subset selection problem, the induction algorithm is used for evaluating a subset of the features.

• Loss function

In the context of decision theory, a loss function assigns a real number to the possible outcomes, reflecting one’s preferences over these outcomes. Contrary to a utility function, greater value of a loss function represents dispreference of the corresponding outcome.

• (p.356) L1, L2

L1 and L2 are standard distances between two vectors. The Lf distance of two vectors x and y is computed as $(∑i|xi−yi|p)1p$; L1 is the Manhattan distance, and L2 is the Euclidean distance.

• Markov compatibility

A joint probability distribution $P$ and a DAG G (over the random variables in $P$) are Markov compatible if each variable is conditionally independent of the set of all its non-descendants, given the set of all its parents.

• Modular feature

A mapping f from graph structures onto 0, 1 is modular if $f(G)=∏i=1nfi(Gi)$, where each fi is a mapping from the subsets of the complementary set $V∖Xi$ onto 0, 1, i.e., f is modular if the graph structures with f=1 can be defined as the free combinations of substructures with $fi=1$ for $i=1,…,n$. For example, the indicator of a directed edge between two nodes is modular, because the graphs with this edge can be defined as two indicators for these two nodes selecting the legal parental sets. Furthermore, the indicator of any subgraph is also modular, because the parental sets covering the subgraph can be defined independently.

• Pairwise association

Pairwise association is a form of statistical association between two random variables. In genetic association studies, the strength of association is measured between a single nucleotide polymorphism and trait status, based on samples that are informative for the trait of interest.

• Observational equivalence class

Two directed acyclic graphs (DAGs) are observationally equivalent if they imply the same set of independence relations. Observational equivalence is a transitive relation; the set of DAGs observationally equivalent to each other is called an observational equivalence class.

• q-value

The q-value is the false discovery rate (FDR) analogue of the p-value. The q-value of an individual hypothesis test is the minimum FDR at which the test may be called significant.

• Sample complexity

Sample complexity is the minimum number of samples a learning algorithm needs for choosing a hypothesis whose error is smaller than a given threshold (\varepsilon) with at least a given (1−δ‎) probability.

• Stable distribution

A distribution is said to be stable if there exists a directed acyclic graph that exactly represents the dependences and independences entailed by the distribution.

• Strong relevance

A feature (attribute, variable) Xi is strongly relevant to Y if there exists some $Xi=xi,Y=y$ and $si=x1,…,xi−1,xi+1,…,xn$ for which $P(xi,si)>0$ such that $P(y|xi,si)≠P(y|si)$. Strong relevance of a variable with respect to a given target variable implies that the variable is indispensable, in the sense that sometimes it is essential for the prediction of the target.

• Utility function

In the context of decision theory, a utility function assigns a real number to the possible outcomes, reflecting one’s preferences over these outcomes. In the context of reporting a given set of strongly relevant variables, a utility function declares how good the given set is.

• (p.357) Weak relevance

A feature Xi is weakly relevant to Y if it is not strongly relevant, and there exists a subset of features $Si′$ of Si for which there exists some $xi,y$, and $si′$ for which $P(xi,si′)>0$ such that $P(y|xi,si′)≠P(y|si′)$. Weak relevance of a variable with respect to a given target variable implies that the variable can sometimes improve the prediction accuracy of the target.

• Wrapper approach

In the feature subset selection problem, the wrapper approach means that the selection algorithm searches for a good subset using the induction algorithm itself (as a black box) as part of the function evaluating feature subsets. The feature subset with the highest evaluation is chosen as the final set on which to run the induction algorithm. Examples include standard likelihood-based forward selection and backward elimination methods in linear regression and logistic regression.

References

Bibliography references:

[1] C.F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. Koutsoukos. Local causal and Markov blanket induction for causal discovery and feature selection for classification. Journal of Machine Learning Research, 11:171–284, 2010.

[2] P. Antal, G. Hajós, and P. Sárközy. Bayesian network based analysis in sequential partial genome screening studies. In MODGRAPH 2009 Probabilistic Graphical Models for Integration of Complex Data and Discovery of Causal Models in Biology, Satellite Meeting of JOBIM 2009, 2009.

[3] P. Antal, G. Hullám, A. Gézsi, and A. Millinghoffer. Learning complex Bayesian network features for classification. In Probabilistic Graphical Models, pages 9–16, 2006.

[4] P. Antal, A. Millinghoffer, G. Hullám, C. Szalai, and A. Falus. A Bayesian view of challenges in feature selection: feature aggregation, multiple targets, redundancy and interaction. Journal of Machine Learning Research – Proceedings Track, 4:74–89, 2008.

[5] D.J. Balding, M. Bishop, and C. Cannings. Handbook of Statistical Genetics, 2007.

[6] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specific independence in Bayesian networks. In E. Horvitz and F. Verner Jensen, editors, Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI 96), pages 115–123. Morgan Kaufmann Publishers, 1996.

[7] W.L. Buntine. Theory refinement of Bayesian networks. In B. D’Ambrosio and P. Smets, editors, Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence (UAI 91), pages 52–60. Morgan Kaufmann Publishers, 1991.

[8] M. Woodward. Epidemiology: Study Design and Data Analysis. Chapman and Hall, 1999.

[9] Shao Q.Ibrahim J.G.Chen, M. Monte Carlo Methods in Bayesian Computation. Springer-Verlag, 2000.

[10] G.F. Cooper, C.F. Aliferis, R. Ambrosino, J. Aronis, B. G. Buchanan, R. Caruana, M. J. Fine, C. Glymour, G. Gordon, B. H. Hanusa, J. E. Janosky, C. Meek, T. Mitchell, T. Richardson, and P. Spirtes. An evaluation of machine-learning methods for predicting pneumonia mortality. Artificial Intelligence in Medicine, 9:107–138, 1997.

[11] G. Cooper. A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 2:203–224, 1997.

[12] G.F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.

[13] G.F. Cooper and C. Yoo. Causal discovery from a mixture of experimental and observational data. In K. B. Laskey and H. Prade, editors, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI 99), pages 116–125. Morgan Kaufmann Publishers, 1999.

[14] H.J. Cordell. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics, 10:392–404, 2009.

[15] (p.358) H.J. Cordell and D.G. Clayton. Genetic association studies. Lancet, 366:1121–1131, 2005.

[16] R. Culverhouse, B.K. Suarez, J. Lin, and T. Reich. A perspective on epistasis: limits of models displaying no main effect. American Journal of Human Genetics, 70:461–471, 2002.

[17] O. Davidovich, G. Kimmel, and R. Shamir. Gevalt: an integrated software tool for genotype analysis. BMC Bioinformatics, 8:2105–2112, 2007.

[18] Holmes C.C.Mallick B.K.Smith A.F.M.Denison, D.G.T. Bayesian Methods for Nonlinear Classification and Regression. Wiley & Sons, 2002.

[19] B.I. Fridley. Bayesian variable and model selection methods for genetic association studies. Genetic Epidemiology, 33:27–37, 2009.

[20] N. Friedman, M. Goldszmidt, and A. Wyner. On the application of the bootstrap for computing confidence measures on features of induced Bayesian networks. In 7th International Workshop on Artificial Intelligence and Statistics, pages 197–202, 1999.

[21] N. Friedman and D. Koller. Being Bayesian about network structure. In C. Boutilier and M. Goldszmidt, editors, Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence(UAI-2000), pages 201–211. Morgan Kaufmann Publishers, 2000.

[22] N. Friedman and D. Koller. Being Bayesian about network structure. Machine Learning, (50):95–125, 2003.

[23] N. Friedman and Z. Yakhini. On the sample complexity of learning Bayesian networks. In E. Horvitz and F. Verner Jensen, editors, Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI 96), pages 274–282. Morgan Kaufmann Publishers, 1996.

[24] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman & Hall, 1995.

[25] P. Giudici and R. Castelo. Improving Markov chain Monte Carlo model search for data mining. Machine Learning, (50):127–158, 2003.

[26] Cooper G.F.Glymour, C. Computation, Causation, and Discovery. AAAI Press, 1999.

[27] S. Gu, A.J. Pakstis, and K.K. Kidd. Haplot: a graphical comparison of haplotype blocks, tagSNP sets and SNP variation for multiple populations. Bioinformatics, 21(20):3938–3939, 2005.

[28] B. Han, M. Park, and X. Chen. A Markov blanket-based method for detecting causal SNPs in GWAS. BMC Bioinformatics, 11:5, 2010.

[29] D.J. Hand. Construction and Assessment of Classification Rules. Wiley & Sons, 1997.

[30] D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, (20):197–243, 1995.

[31] X. Jiang, M.M. Barmada, and S. Visweswaran. Identifying genetic interaction in genome-wide data using Bayesian networks. Genetic Epidemiology, (34):575–581, 2010.

[32] R. Kohavi and G.H. John. Wrappers for feature subset selection. Artificial Intelligence, 97:273–324, 1997.

[33] M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, (5):549–573, 2004.

[34] D. Koller and M. Sahami. Toward optimal feature selection. In Thirteenth International Conference on Machine Learning, pages 284–292. Morgan Kaufmann Publishers, 1996.

[35] C. Kooperberg and I. Ruczinski. Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology, 28:157–170, 2005.

[36] O. Lautner-Csorba et al. Candidate gene association study in pediatric acute lymphoblastic leukemia evaluated by Bayesian network based Bayesian multilevel analysis of relevance. BMC Medical Genomics, 5:42, 2012.

[37] D. Madigan, S.A. Andersson, M. Perlman, and C.T. Volinsky. Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs. Communications in Statistics: Theory and Methods, (25):2493–2520, 1996.

[38] C. Meek. Causal inference and causal explanation with background knowledge. In P. Besnard and S. Hanks, editors, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI 95), pages 403–410. Morgan Kaufmann Publishers, 1995.

[39] (p.359) J.H. Moore, J.C. Gilbert, C.T. Tsai, F.T. Chiang, T. Holden, N. Barney, and B.C. White. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology, 241:252–261, 2006.

[40] M.Y. Park and T. Hastie. Penalized logistic regression for detecting gene interactions. Biostatistics, (9):30–50, 2007.

[41] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.

[42] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.

[43] D. Pe’er, A. Regev, G. Elidan, and N. Friedman. Inferring subnetworks from perturbed expression profiles. Bioinformatics, (17):215–224, 2001.

[44] G. Peng, L. Luo, H. Siu, Y. Zhu, P. Hu, S. Hong, J. Zhao, X. Zhou, J.D. Reveille, L. Jin, C.I. Amos, and M. Xiong. Gene and pathway-based second-wave analysis of genome-wide association studies. European Journal of Human Genetics, 18:111–117, 2010.

[45] P.C. Phillips. Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics, 9:855–867, 2008.

[46] P.N. Robinson and S. Mundlos. The human phenotype ontology. Clinical Genetics, 77:525–534, 2010.

[47] I. Ruczinski, C. Kooperberg, and M. LeBlanc. Logic regression. Journal of Computational and Graphical Statistics, 12:475–511, 2003.

[48] Y. Saeys, I. Inza, and P. Larra naga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23:2507–2517, 2007.

[49] H. Schwender and K. Ickstadt. Identification of SNP interactions using logic regression. Biostatistics, (9):187–198, 2008.

[50] T. Silander, P. Kontkanen, and P. Myllymäki. On sensitivity of the MAP Bayesian network structure to the equivalent sample size parameter. In R. Parr and L. C. van der Gaag, editors, Proceedings of the Twenty-third Conference on Uncertainty in Artificial Intelligence (UAI-07), pages 360–367. AUAI Press, 2007.

[51] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2001.

[52] M. Stephens and D.J. Balding. Bayesian statistical methods for genetic association studies. Nature Review Genetics, 10(10):681–690, 2009.

[53] M. Stephens and P. Donnelly. A comparison of Bayesian methods for haplotype reconstruction from population genotype data. The American Journal of Human Genetics, 73(5):1162–1169, 2003.

[54] J.D. Storey and R. Tibshirani. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America, 100(16):9440–9445, 2003.

[55] S.J. Swamidass, C. Azencott, K. Daily, and P. Baldi. A croc stronger than roc: measuring, visualizing and optimizing early retrieval. Bioinformatics, 26:1348–1356, 2010.

[56] I. Tsamardinos and C. Aliferis. Towards principled feature selection: relevancy, filters, and wrappers. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Morgan Kaufmann Publishers, 2003.

[57] M. Ueno. Learning networks determined by the ratio of prior and data. In P. Grünwald and P. Spirtes, editors, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI 10), pages 598–605. AUAI Press, 2010.

[58] I. Ungvári, G. Hullám, P. Antal, P.S. Kiszel, A. Gézsi, É. Hadadi, V. Virág, G. Hajós, A. Millinghoffer, A. Nagy, A. Kiss, Á.F. Semsei, G. Temesi, B. Melegh, P. Kisfali, M. Széll, A. Bikov, G. Gálffy, L. Tamási, A. Falus, and C. Szalai. Evaluation of a partial genome screening of two asthma susceptibility regions using Bayesian network based Bayesian multilevel analysis of relevance. PLOS ONE, (7):e33573, 2012.

[59] G. Varga, A. Szekely, P. Antal, P. Sárközy, Z. Nemoda, Z. Demetrovics, and M. Sasvari-Szekely. Additive effects of serotonergic and dopaminergic polymorphisms on trait impulsivity. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 159:281–288, 2012.

[60] R.E. Wyllys. Empirical and theoretical bases of Zipf’s law. Library Trends, (30):53–64, 1981.

[61] (p.360) H. Xing, P.D. McDonagh, J. Bienkowska, T. Cashorali, K. Runge, R.E. Miller, D. DeCaprio, B. Church, R. Roubenoff, I.G. Khalil, and J. Carulli. Causal modeling using network ensemble simulations of genetic and gene expression data predicts genes involved in rheumatoid arthritis. PLOS Computational Biology, 7(3):e1001105, 2011.

[62] C. Yoo and G. Cooper. An evaluation of a system that recommends microarray experiments to perform to discover gene-regulation pathways. Artificial Intelligence in Medicine, (31):169–182, 2004.

[63] L. Yu and H. Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5:1205–1224, 2004.

[64] J. Zhang, W.L. Rowe, J.P. Struewing, and K.H. Buetow. HapScope: a software system for automated and visual analysis of functionally annotated haplotypes. Nucleic Acids Research, 30(23):5213–5221, 2002.

[65] Y. Zhang and J.S. Liu. Bayesian inference of epistatic interactions in case-control studies. Nature Genetics, 39:1167–1173, 2007.

## Notes:

(1) For a brief definition of the wrapper framework, see Appendix 13.A (page 354) or [32].

(2) For a brief definition of loss functions in the feature subset selection context, see Appendix 13.A (page 354).

(3) For the definition of independence map, see Appendix 13.A (page 354) or [41].

(4) For the definition of a causal model, see Appendix 13.A (page 354) or [26, 42].

(5) For the definition of stable distribution, see Appendix 13.A (page 354).

(6) $P$ is Markov compatible with respect to G if $P$ can be factorized by G [42].

(7) For a brief definition of observational equivalence classes of BNs, see Appendix 13.A (page 354).

(8) For the definition of d-separation, see Appendix 13.A (page 354) or [42].

(9) For a brief definition of sample complexity, see Appendix 13.A (page 354).

(10) Note that the present definition of confounded relevance covers only a partial aspect of confounding [8].

(11) The interactionist relevance (IR) representation of a contextually relevant factor Xi for target Y using variable C as a common child is incorrect, because it implies that the effect of variable Xi on Y related to variable C is not present if we do not know C. This typically arises if the Xi and C variables in the interaction are dependent and Xi has negligible main effect on target Y. However, the IR representation is correct in the special case of a purely epistatic factor, which has no main effect [16].

(12) In a similar vein, the local causal discovery methods apply well-selected local tests instead of using global scores or interventional data, e.g., see [11, 13].

(13) For a brief definition of Bayesian model averaging, see Appendix 13.A (page 354).

(14) For the definition of the indicator function, see Appendix 13.A (page 354).

(15) For a brief definition of indicator function, see Appendix 13.A (page 354).

(16) For a brief definition of modular features, see Appendix 13.A (page 354).

(17) The goal of a post-hoc analysis is to search the primary, “raw” results (in our case the MBS posteriors) for patterns that are not expected a priori or that are too numerous. In our case, such patterns are the k-subMBS and the k-supMBS posteriors, that would be redundant and intractable if directly sampled during the MCMC run.

(18) Confidence intervals for MCMC estimation can be calculated e.g. according to the batch means method [9].

(19) Model complexity also depends on the maximum number of parents and the local models used in the BN (A local model defines the conditional distribution for a variable given its parents).

(20) Odds-ratios are standard measures of the effect size of the binary predictor X on binary target Y. The odds-ratio is defined as the ratio of the odds of Y in the case when X=0 and X=1, respectively: $P(Y=1|X=1)/P(Y=0|X=1)/P(Y=1|X=0)/P(Y=0|X=0)$.

(21) The entropy H(X) of a discrete random variable X with probability mass function P(X) is the expected value of the negative logarithm of P(X), that is: $H(X)=E[−ln(P(X))]$. An important consequence of this definition is that a uniformly distributed random variable (i.e., whose distribution is completely “flat”) has maximal entropy. Note that the entropy depends on the cardinality, e.g., in case of uniform distributions, entropy is higher for the MBG than for the MBS level. Also note that the MBM level consists of separate entropies for each $MBM(Y,Xi,G)$, which are summed together to provide an aggregate measure (this summation corresponds to the simplification that $MBM(Y,Xi,G)$ and $MBM(Y,Xj,G)$ are mutually independent).

(22) For the definition of AUC, see Appendix 13.A (page 13.7).

(23) For the definition of the q-value, see Appendix 13.A.

(24) For the definition of q-value, see Appendix 13.A (page 354).

(25) Consistency expresses the lack of contradiction, e.g., that the set does not contain variables from competing, alternative theories.

(26) Completeness expresses the lack of omission, e.g., that the set contains all the variables corresponding to a given theory.

(27) For a brief description of Bayesian model averaging, see Appendix 13.A.

(28) Note that the BN-BMLA methodology is neutral for the selection of target variables and that the MBM relation is symmetric. Consequently, a genetic variable or small group of genetic variables can also be the target to explore their phenotypic relevance in case of rich phenotypic descriptors.