Bayesian, Systemsbased, Multilevel Analysis of Associations for Complex Phenotypes: from Interpretation to Decision
Bayesian, Systemsbased, Multilevel Analysis of Associations for Complex Phenotypes: from Interpretation to Decision
Abstract and Keywords
The relative scarcity of the results reported by genetic association studies (GAS) prompted many research directions. Despite the centrality of the concept of association in GASs, refined concepts of association are missing; meanwhile, various feature subset selection methods became de facto standards for defining multivariate relevance. On the other hand, probabilistic graphical models, including Bayesian networks (BNs) are more and more popular, as they can learn nontransitive, multivariate, nonlinear relations between complex phenotypic descriptors and heterogeneous explanatory variables. To integrate the advantages of Bayesian statistics and BNs, the Bayesian network based Bayesian multilevel analysis of relevance (BNBMLA) was proposed. This approach allows the processing of multiple target variables, while ensuring scalability and providing a multilevel view of the results of multivariate analysis. This chapter discusses the use of Bayesian BNbased analysis of relevance in exploratory data analysis, optimal decision and study design, and knowledge fusion, in the context of GASs.
Keywords: genomewide association studies, Bayesian network, relevance
Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics. First Edition. Christine Sinoquet & Raphaël Mourad (Eds).© Oxford University Press 2014. Published in 2014 by Oxford University Press.
The relative scarcity of the results reported by genetic association studies (GASs) has prompted many research directions, such as the use of univariate Bayesian analysis and the use of multivariate, complex, or integrated models. Despite the centrality of the concept of association in GASs, refined concepts of association are missing, meanwhile various feature subset selection methods became de facto standards for defining multivariate relevance. On the other hand, probabilistic graphical models, including Bayesian networks (BNs), are more and more popular, as they can learn nontransitive, multivariate, nonlinear relations between complex phenotypic descriptors and heterogeneous explanatory variables. To integrate the advantages of Bayesian statistics and BNs, the BNbased Bayesian multilevel analysis of relevance (BNBMLA) was proposed. This approach allows the processing of multiple target variables, while ensuring scalability and providing a multilevel view of the results of multivariate analysis. This chapter (p.319) discusses the use of Bayesian BNbased analysis of relevance in exploratory data analysis, optimal decision and study design, and knowledge fusion, in the context of GASs. First, various BNbased concepts of association and relevance are overviewed. In particular, the chapter analyzes the connections between BNs and strong/weak relevance and Markov blankets/boundaries or related sets; relevance relations are defined. Then the advantages of the Bayesian statistical approach for sufficiently characterizing and exploring weakly significant results are shown. For this purpose, the focus is set on the posteriors that are defined over the relevance relations abovementioned. The next section discusses Bayes optimal decisions on multivariance relevance in GAS results. In the last section, it is shown that the Bayesian BNbased approach provides a framework for the fusion of the results obtained through various genetic data analyses. This last section describes a procedure dedicated to estimating the posteriors of complex features, such as those involved in the hierarchical, interrelated hypotheses of the BNBMLA framework.
13.1 Introduction
The relative scarcity of results reported by genetic association studies has led to several approaches such as univariate Bayesian analysis [5, 52] and the use of multivariate, complex or integrated models [61, 65]. Probabilistic graphical models, including Bayesian networks (BNs), are more and more popular as they can learn nontransitive, multivariate, nonlinear relations between complex phenotypic descriptors (also known as target variables, dependent variables, or dependent outcomes) and heterogeneous, mostly genetic, explanatory variables (input variables, factors or predictors, also known as attributes or features in statistics). We discuss the limitations of conventional statistical association here and show how they can be circumvented using the notion of the socalled relevance of input variables to one or more target variables. We will show that relevance is a useful extension of association in Subsection 13.2.1. Recently we proposed the Bayesian networkbased Bayesian multilevel analysis of relevance (BNBMLA), which integrates the advantages of Bayesian statistics and BNs [4]. Moreover, this approach allows investigation of multiple target variables and provides scalable intermediate levels between univariate strong relevance and full multivariate relevance to interpret the results at partial multivariate levels. Thus, BNBMLA provides a multilevel view on multivariate analysis. We discuss the use of the BNBMLA relevance analysis in data exploration, optimal decision, and knowledge fusion.
First, we overview some of the structural properties of BNs, with special emphasis on systemsbased analysis of relevance in Section 13.2. Section 13.3 shows the advantages of the Bayesian statistical approach in the characterization and exploration of weakly significant results. In Section 13.4, we discuss the application of Bayesian decision theory to GAS. In Section 13.5, we deal with the Bayesian interpretation and fusion of results from data analysis. In this chapter, we also consider the practical and computational aspects of the Bayesian inference: we apply the methods described in the domain of asthma.
Input variables in the current application typically correspond to single nucleotide polymorphisms (SNPs), whereas target variables represent asthmatic and possibly other phenotypic aspects. The notation is as follows: input variables (typically SNPs) are denoted with X ($X=\{{X}_{1},\dots ,{X}_{n}\}$), target variables (typically asthma and possibly other phenotypic variables) are denoted with Y ($Y=\{{Y}_{1},\dots ,{Y}_{m}\}$) and V ($V=X\cup Y$) denotes all the variables. The data set with N samples is denoted D_{N}.
(p.320) 13.2 Bayesian networkbased Concepts of Association and Relevance
Refined concepts of association are missing in genetic association studies despite the centrality of the concept. Therefore, various feature subset selection (FSS) methods became de facto standards for quantifying the joint relevance of multiple variables and their interactions, which will be referred to as multivariate relevance (for an overview of FSS the reader is referred to [48]). In this section, we discuss the use of Bayesian network structural properties to define such refined concepts.
13.2.1 Association and Strong Relevance
The extension of standard statistical pairwise association (univariate case) toward manytoone or manytomany relations (multivariate case) is a challenging task, because several goals can be formulated, such as assessing the predictive performance and interdependence of multiple predictors.
In the predictive approach to the identification of relevant variables, relevance to the target variable Y is defined in the wrapper framework below.^{1} The inherent limitation of a wrapper approach is that it is biased by the class of predictive model used, by the optimization algorithm, by the data set, and by the loss function^{2} quantifying false and missed discovery [32]. A typical example setting in casecontrol studies is the use of logistic regression fitted by a gradient descent method to minimize misclassification error, optionally with complexity regularization to minimize overfitting for a given data set. The standard conditional probabilistic version of relevance types is free of model class, optimization, data set, or loss function, and is defined as follows:
Definition 13.1
A feature (input variable) X_{i} is strongly relevant to Y, if there exists an ${X}_{i}={x}_{i},Y=y$, and ${s}_{i}={x}_{1},\dots ,{x}_{i1},{x}_{i+1},\dots ,{x}_{n}$, ${S}_{i}=\{{X}_{1},\dots ,{X}_{i1},{X}_{i+1},\dots ,{X}_{n}\}$ such that $\mathbb{P}({x}_{i},{s}_{i})>0$ and $\mathbb{P}(y{x}_{i},{s}_{i})\ne \mathbb{P}(y{s}_{i})$. A feature X_{i} is weakly relevant if it is not strongly relevant and there exists a subset of features ${S}_{i}^{\prime}$ of S_{i} for which there exists some ${x}_{i},y$, and ${s}_{i}^{\prime}$ such that $\mathbb{P}({x}_{i},{s}_{i}^{\prime})>0$ and $\mathbb{P}(y{x}_{i},{s}_{i}^{\prime})\ne \mathbb{P}(y{s}_{i}^{\prime})$. A feature is relevant if it is either weakly or strongly relevant; otherwise it is irrelevant [32].
Strong relevance formalizes the necessary and sufficient set of predictors, whereas weak relevance does not require necessity. For example, direct associations correspond to strong relevance, whereas indirect associations, mediated by other variables in linkage, correspond to weak relevance (for direct, indirect, and confounded associations, see [15]).
Besides the predictive aspects, relevance can also be defined using BNs to represent complex dependence patterns between phenotypic descriptors and multiple predictors, and to express causal relationships between variables. In general, a BN represents the joint probability distribution $\mathbb{P}(V)$ with two main components. The first main component is a directed acyclic graph (DAG) G, whose nodes are random variables from the set V. We often refer to G as the structure of the BN. A directed edge pointing from a socalled parent variable X_{p} to its child variable X_{c} implies not only that X_{c} depends on X_{p} but also that X_{p} is a direct cause of X_{c} in the sense that (p.321) the BN does not contain any other variable X_{i} that mediates the effect of X_{p} on X_{c}. The strength of such dependence of X_{c} on X_{p} is given by the extent to which the probability distribution of X_{c} shifts in response to a certain change in the value of X_{p}. This gives rise to a separate set of parameters defining the conditional distribution for each variable, given each possible value of its parent variables; these parameter sets altogether form θ, the second main component of the BN.
The BN representation is unique in offering three integrated levels to formulate these aspects [42]:
1 Predictive performance: Multivariate relevance can be based on the concepts of predictivity, cardinality and redundancy of the predictors.

2 Global patterns of independences: The global characterization of relevance relations based on the overall independence map^{3} of the application domain is of fundamental importance to clarify direct, indirect and confounded associations [15].

3 Causal aspects: A causal model^{4} also helps to dissect and estimate the effect size of genetic mechanisms related to complex phenotypes described by multiple variables [42].
The independence mapbased approach provides the definition of joint relevance of multiple input variables (therefore, this approach is more naturally suited to multivariate analysis) [41]:
Definition 13.2
A set of variables ${X}^{\prime}\subseteq V$ is called a Markov blanket set (MBS) of Y with respect to the distribution $\mathbb{P}(V)$, if ${(Y\mathrm{\perp}\mathrm{\perp}V\setminus {X}^{\prime}{X}^{\prime})}_{p}$, where $\mathrm{\perp}\mathrm{\perp}$ denotes conditional independence. A minimal Markov blanket is called a Markov boundary.
The probabilistic definitions of Markov blanket and Markov boundary became fundamental concepts in the modern multivariate conceptualization of multivariate associations and provide foundations and motivations for many feature subset selection methods [1, 34, 56]. It follows from these basic characteristics of BNs that they offer a graphical representation of relevance derived both from the predictive approach (cf. strong and weak relevance) and the independence mapbased one (cf. Markov blankets and boundaries). Thus, BNs provide a unifying framework for both approaches. Additionally, this common framework allows the extension of the concept of relevance relations with causality. The connections between BNs, strong/weak relevance, and Markov blankets/boundaries are provided by the following theorems:
Theorem 13.1
For a distribution $\mathbb{P}(V)$ defined by the BN $(G,\mathrm{\theta})$ the variables bd(Y,G) form a (not necessarily unique or minimal) Markov blanket of Y, where bd(Y,G) denotes the set of parents, children, and the children’s other parents for Y [41].
Theorem 13.2
If $\mathbb{P}(V)$ is a stable distribution^{5} and DAG G is Markov compatible with $\mathbb{P}(V)$,^{6} then bd(Y,G) is the unique, minimal Markov blanket (called a Markov boundary). In stable distributions, bd(Y,G) also identifies the strongly relevant variables [56].
(p.322) A BN structure explicitly represents both aspects of the Markov boundary: on the one hand, bd(Y,G) is necessary because it contains only strongly relevant variables; on the other hand, bd(Y,G) is sufficient because none of the remaining variables in V (i.e., the remaining nodes in G) is strongly relevant.
Definition 13.3
The induced pairwise relation $MBM(Y,{X}_{i},G)$ with respect to G between Y and X_{i} is called Markov blanket membership (MBM):
(13.1)$$MBM(Y,{X}_{i},G)\iff {X}_{i}\in bd(Y,G).$$
As Theorem 13.2 shows, the boundary graph bd(Y,G) exactly represents the (probabilistically defined) Markov boundary, and the MBMs $MBM(Y,{X}_{i},G)$ exactly represent the (probabilistically defined) strongly relevant variables X_{i} for a stable distribution $\mathbb{P}(V)$ and a Markov compatible DAG G. The following Subsection 13.2.2 describes the mathematical background of the soundness of representing relevance by bd(Y,G). Readers interested only in the application of the concepts can continue at Subsection 13.2.3 (page 323).
13.2.2 Stable Distributions, Markov Blankets and Markov Boundaries
The limitation of the assumption about the stability of the distribution is a subtle issue. The main reason lies in the independences, which are only parametrically and not structurally defined [41, 42]. Regardless of whether our hypotheses are the BN structures (DAGs G) or the observational equivalence classes of BNs^{7} (${G}^{\sim}$), then for a given G or ${G}^{\sim}$ the structurally implied conditional independences through dseparation^{8} hold in any Markov compatible distribution $\mathbb{P}(V)$ [42]. This means that bd(Y,G) represents a Markov blanket in any Markov compatible distribution and $MBM(Y,{X}_{i},G)$ represents relevant variables. Note that the graphical definition of MBM satisfies the probabilistic definition of Markov blanket in Definition 13.2 according to Theorem 13.1 assuming a Markov compatible distribution with G, because bd(Y,G) is always a Markov blanket. Furthermore, the definition of relevance based on bd(Y,G) avoids the problem of the multiplicity of Markov blankets and defines mutually exclusive and exhaustive events (a sample space), which will be necessary to define a probability distribution over the sets of strongly relevant variables (see Section 13.3, page 323).
However, according to Theorem 13.2, in case of a stable distribution, the implied results are sharper: bd(Y,G) represents a Markov boundary and $MBM(Y,{X}_{i},G)$ represents strongly relevant variables. Whereas only Markov blanket and relevance can be used generally, the nonstable distributions are rare in a technical sense. This property formally appears in the Bayesian approach we are using in this chapter as follows. Using a continuous parameter distribution $\mathbb{P}(\mathrm{\theta}G)$, the specified distribution $\mathbb{P}(\mathit{V})$ is stable with probability 1 [38]. Thus, in that Bayesian context, bd(Y,G) and $MBM(Y,{X}_{i},G)$ almost surely represent Markov boundaries and strong relevance and therefore we can use the stricter Markov boundary concept, despite the more widespread use of the term of Markov blanket.
For simplicity, we also refer to bd(Y,G) as the Markov boundary set of Y using the notation $MBS(Y,G)$, assuming that the hypothesized distribution(s) is (are) stable with respect to G. Under (p.323) the same assumption, we also refer to $MBM(Y,{X}_{i},G)$ as a strong relevance relation. According to Theorem 13.2, this graphbased definition of the MBM relation is equivalent to the probabilistic definition of the strong relevance relation assuming a Markov compatible stable distribution. Fig. 13.1 shows the possible relevance status of the variables, in cases of stability and nonstability of distributions Markov compatible with G.
13.2.3 Further relevance types
The concept of multivariate relevance can be further enriched using the common conceptual framework of BNs to represent all the relevant interactions between the strongly relevant variables as follows:
Definition 13.4
A subgraph of BN structure G is called the Markov blanket graph or mechanism boundary graph (MBG) MBG(Y,G) of variable Y if it includes the nodes in the Markov blanket defined by bd(Y,G) and the incoming edges into Y and its children [3].
An MBG represents all the necessary and sufficient interactions with respect to the Markov boundary of a given variable; thus it has a fundamental role both in the prediction of a target variable and in the interpretation of the interactions and mechanisms for a target variable.
Throughout the chapter, we use the term feature in a broad sense to denote a function $F({V}^{\prime},G)=f$ with arguments Vʹ (${V}^{\prime}\subseteq V$) and DAG G, e.g., MBG(Y,G)=mbg, MBS(Y,G)=mbs and MBM(Y,X_{i},G); that is, we concentrate only on the structural aspects of DAGs for a given subset of variables Vʹ. However, the concept of features over DAGs is also used to denote a quantitative property of a given DAG, such as maximal indegrees, outdegrees, clique sizes, or length (p.324) of directed paths. Additionally, the term “feature” is also used to denote variables, e.g., in the case of feature subset selection (FSS) problems.
The Markov blanketbased extensions of the relevance concepts express local aspects, i.e., neighboring variables. The advantages of BNs for representing global dependence maps and relevance relations are well known, but their direct application in FSS is hindered by their high computational and sample complexity.^{9} Now we consider the use of the global representational properties of BNs to characterize relevance relations, e.g., the type of association between nonneighboring, distant variables. The subtypes of relevance are listed in Table 13.1 along with the notation assigned to them.
Table 13.1 Graphical model based definition of relevance types.
Relation 
Notation 
Graphical 

Direct causal relevance 
DCR(X,Y) 
There is an edge between X and Y. 
Indirect causal relevance 
ICR(X,Y) 
There is directed path between X and Y. 
Causal relevance 
CR(X,Y) 
DCR(X,Y) or ICR(X,Y). 
Confounded relevance 
ConfR(X,Y) 
X and Y have a common ancestor. 
(Pairwise) Association 
A(X,Y) 
DCR(X,Y) or ICR(X,Y) or ConfR(X,Y). 
Interactionist relevance 
IR(X,Y) 
X and Y have a common child. 
Strong relevance 
SR(X,Y) 
IR(X,Y) or DCR(X,Y) (also MBM(X,Y)). 
Mixed conditional relevance 
MCondR(X,Y) 
There is an undirected path between X and Ywith a node with converging edges. 
Conditional relevance 
CondR(X,Y) 
IR(X,Y) or MCondR(X,Y). 
Weak relevance 
WR(X,Y) 
MCondR(X,Y) or ICR(X,Y) or ConfR(X,Y) 
For their causal interpretation under the causal Markov assumption, the reader is referred to [26, 42].^{10} It needs to be emphasized that these relations represent different aspects of relevance and there are subtle differences with respect to their use in genetics, because of the possibility of multiple target variables and the possibility of nongenetic predictors. Following the usual genetic terminology [15], direct relevance (DR) formalizes the concept of direct association, although it covers direct consequences as well. Indirect causal relevance (ICR) and confounded relevance (CR) separate and express the concepts of indirect association and confounded association. Pairwise association (A) represents the usual association, which is the union of direct, indirect, and confounded associations. Interactionist relevance (IR) deviates from the purely epistatic relation [16], because this latter is the consequence of vanishing marginal effects of the individual variables, which can be modeled by contextual dependences (see Definition 13.5, page 325). To summarize, the most notable difference between the standard concept of pairwise association and (p.325) strong relevance is that association includes certain forms of weak relevance (i.e., confounded relevance (ConfR) and transitive causal relevance (TCR)) and that it does not include the case represented by interactionist relevance. A direct consequence is that predictors exclusively in interactionist relevance will be filtered out by pairwise methods typical in highdimensional studies, as they are not associated by definition. Also note that most of these relations are not mutually exclusive, e.g., a predictor may have direct and indirect effects at the same time, which can also be confounded. An overview of these relations is shown in Fig. 13.2. The applications of these relevance types are described in Subsection 13.3.6.
Hitherto, the definitions of relevance were based on the general concept of conditional independence, but conditional independence can be made more specific by introducing contextual independence when independence is present only in a given context (for its use in the field of BNs, the reader is referred to e.g., [6]). In genetics, this representation can express a related, seemingly complementary phenomenon if a variation has no effect with respect to a given target except in the presence of other variations [14, 45].
Definition 13.5
A variable X_{i}, which is strongly relevant to Y, is also contextually relevant to Y, if there exists some ${X}_{i}={x}_{{i}_{1}},{x}_{{i}_{2}}$ and context ${s}_{i}={x}_{1},\dots ,{x}_{i1},{x}_{i+1},\dots ,{x}_{n}$ for which $\mathbb{P}({x}_{{i}_{1}},{s}_{i}),\mathbb{P}({x}_{{i}_{2}},{s}_{i})>0$ such that $\mathbb{P}(Y{x}_{{i}_{1}},{s}_{i})=\mathbb{P}(Y{x}_{{i}_{2}},{s}_{i})$.
Contextual relevances can be exactly and explicitly represented by special local dependence models in BNs, such as decision trees or decision graphs, which can express dependences at the value level (which is below the variable level) [6]. These models allow the explicit representation of the contexts defined by value configurations $C=c\text{}(C\subset V)$, in those cases in which X_{i} is conditionally independent of Y. NonBNbased methods implicitly capable of addressing contextual relevance include multifactor dimensionality reduction (MDR) [39] and logic regression [47].
The identification of a contextually relevant factor X_{i} is particularly challenging, because it has an effect—by definition—in particular contexts C=c. If its effect strength is moderate in these contexts or if these contexts are very rare, then reliable identification of the relevance of X_{i} statistically requires prior knowledge and multivariate modeling. Furthermore, it is also possible, in the case of a purely epistatic factor, that a strongly relevant predictor has no main effect [16]. (p.326) A practical consequence is that the statistical identification of a contextually relevant factor is not possible without observing other variables forming its context. In BNs lacking the ability to explicitly represent contextual independence, this phenomenon frequently leads to misinterpretations because of the incorrect representation of a contextually relevant factor X_{i} by an interactionist relevance relation IR(X,Y) with target variable Y.^{11} Using omics data, such as genomewide association studies (GWAS) data or nextgeneration sequencing (NGS) data, the lack of observation of the other variables involved in the interaction is diminished. However, the problem of multiple hypothesis testing arises in the frequentist statistical framework, because of the high number of potential interactions: the reader is directed to Section 13.4 (page 344) for the discussion of multiple hypotheses in the Bayesian approach and to [19, 35, 65] for nonBNbased Bayesian methods to explore interactions.
13.2.4 Necessary Subsets and Sufficient Supersets in Strong Relevance
The MBM feature gives an overall characterization of strong relevance for each predictor separately but does not capture the joint relevance of predictors. At the other extreme, Markov boundary subsets characterize the joint strong relevance of predictors, but the number of possible MB sets is exponential, which is intractable computationally and statistically. The concept of kary Markov boundary subsets, focusing on k sized sets of variables, was introduced to support a constrained multivariate analysis of strong relevance, called partial multivariate analysis of relevance [4]. Here we complement this subrelevance concept with an analogous concept of suprelevance.
Definition 13.6
For a distribution $\mathbb{P}(V)$ with Markov boundary set mbs, a set of variables s is called subrelevant if it is a kary Markov boundary subset (ksubMBS), i.e., $s=k$ and $s\subseteq \mathrm{m}\mathrm{b}\mathrm{s}$. A set of variables s is called suprelevant if it is a kary Markov boundary superset (ksupMBS), i.e., $s=k$ and $\mathrm{m}\mathrm{b}\mathrm{s}\subseteq s$.
The ksubMBS and ksupMBS concepts are more flexible than the MBS relation, because they can be used to capture the presence or the absence of relevant variables. A ksubMBS set s_{sub} contains those variables that are strongly relevant without specifying the status of the other variables in its complementary ${s}_{\text{sub}}^{c}$ (i.e., a ksubMBS denotes a “necessary” set of variables). The complement of a ksupMBS set ${s}_{\text{sup}}^{c}$ contains all variables that are not strongly relevant, but it does not state anything about the strong relevance of the variables in ${s}_{\text{sup}}$ (i.e., a ksupMBS denotes a “sufficient” set of variables). Note that the probabilistically defined concept of subrelevance can be seen as a multivariate extension of strong relevance for variables in Definition 13.1 (page 320). Note also that the concept of suprelevance is equivalent to the concept of Markov blankets, but for symmetry and uniformity we use the “sub” and “sup” terminology. We use the term kMBS to denote together the ksubMBS and ksupMBS concepts.
(p.327) Furthermore, the ksubMBS and ksupMBS concepts form hierarchically related, overlapping hypotheses, e.g., a ksubMBS hypothesis can be extended to multiple, more complex (k+1)subMBS hypotheses. In fact, the scalable polynomial cardinality of the set of ksubMBSs and ksupMBSs bridges the linear cardinality of the features and the exponential complexity of the MBS cardinality (${O}(n)<{O}(\left(\genfrac{}{}{0em}{}{n}{k}\right))<{O}({2}^{n})$), where n denotes the number of observed variables. Because the cardinality of the MBGs and DAGs is even higher [12], we can think of MBMs, ksubMBSs/ksupMBSs, MBSs, MBGs, and DAGs as more and more complex hypotheses about relevance. Indeed, they form a hierarchy of levels of abstraction to support a multilevel analysis, in which intermediate levels of kMBS for varying k allow a scalable partial multivariate analysis focusing on k number of variables.
An application of these concepts to the problem domain of asthma is illustrated in Fig. 13.3.
13.2.5 Relevance for Multiple Targets
If there are multiple target variables Y which have to be examined together, and if the relations among them are irrelevant, one may ask for the variables relevant to the target set. The basic concepts of relevance for a single target variable can easily be extended to multiple targets [4].
Definition 13.7
A variable X_{i} is strongly (weakly) relevant to Y, if and only if it is strongly (weakly) relevant to any ${Y}_{i}\in Y$.
We introduce additional relations, such as exclusive or multiple relevance, in order to better characterize the relevance types of a predictor with respect to multiple target variables. The corresponding definitions are given in Table 13.2, using a Bayesian network representation. We refer the reader to Subsection 13.3.6 (page 340) for the application of these relations in asthma and allergy research.
Table 13.2 Subtypes of multitarget relevance. The other targets except Yʹ are denoted as ${Y}^{}$ (${Y}^{}=Y\setminus {Y}^{\prime}$).
Direct relation(s) to 
Notation 
Graphical 

One or more targets 
EdgeToAny(X,Y) 
There is an edge between X and Y. 
Exactly one of the targets 
EdgeToExactlyOne(X,Y) 
There is exactly one edge 
between X and Y. 

Two or more targets 
MultipleEdges(X,Y) 
There is more than one 
edge between X and Y. 

One or more other targets 
$EdgeToOthers(X,{Y}^{},{Y}^{\prime})$ 
There is at least one edge between 
X and Y^{−}, but not between X and Yʹ. 
MBM relation(s) to 
Notation 
Graphical 

One or more targets 
MBMToAny(X,Y) 
There is a ${Y}_{i}\in Y$ with $MBM(X,{Y}_{i})$. 
Exactly one of the targets 
MBMToExactlyOne(X,Y) 
There is exactly one ${Y}_{i}\in Y$ 
with $MBM(X,{Y}_{i})$. 

Two or more targets 
MultipleMBMs(X,Y) 
There is at least two ${Y}_{i}\in Y$ 
with $MBM(X,{Y}_{i})$. 

One or more other targets 
$MBMWithOthers(X,{Y}^{},{Y}^{\prime})$ 
There is a ${Y}_{i}\in {Y}^{}$ 
with $MBM(X,{Y}_{i})$ and $\mathrm{\neg}MBM(X,{Y}^{\prime})$. 
(p.328) 13.3 A Bayesian View of Relevance for Complex Phenotypes
The Bayesian network representation, along with the concept of Markov blanket sets and strong relevance, opened many research directions in feature learning, in the feature subset selection problem and in genetic association studies [48]. The “filter” approaches, later termed “local causal” approaches, emerged from Markov Blanket sets (MBSs) and strong relevance [1, 10, 34, 63, 28].^{12} However, despite the rapid development of methods designed to identify an optimal MBS, the global significance of the optimal MBS in the frequentist framework, as well as the lack of dominant MBSs in the Bayesian framework, are still overlooked.
Bayesian methods are increasingly popular in genetic association studies for their ability to sufficiently characterize and explore weakly significant results and to cope with multiple hypothesis testing (for the general approach see [18, 24]; for the application to GAS see [52]; for methods see [19, 35, 65]).
The Bayesian approach aims at determining the posterior distribution of each of the relevance relations defined in Section 13.2 (page 320) given some data, which is used to update the (p.329) prior probability distribution, i.e., our knowledge on these relations prior to obtaining those data. (Henceforth we will use the shorthand “posterior” both for the posterior probability of the specific value of a feature and for the posterior probability distribution of a feature in general if it is clear from the context.) These posterior distributions, such as the ones of Markov boundary graphs, Markov boundary sets, or Markov boundary memberships, are derived by the propagation of the DAG posterior to these “upper” levels by aggregation. Technically speaking, these posteriors are induced^{13} by the posterior $\mathbb{P}(G{D}_{N})$ over the DAG space given the data set D_{N}, because MBG(Y,G), MBS(Y,G), and $MBM(Y,{X}_{i},G)$ are functions (transformations) interpreted over DAGs (i.e., they map each DAG G to a unique value). Consequently, a (normalized) probability distribution over DAGs will induce a (normalized) distribution over the MBGs, the MBSs, and the MBM for each X_{i} as follows:
where 1(A) denotes the indicator function^{14} of event A.
Note that MBS(Y,G) and $MBM(Y,{X}_{i},G)$ depend only on MBG(Y,G); thus the functions $MBS(Y,G):G\to mbs$, and $MBM(Y,{X}_{i},G):G\to \{0,1\}$ can be analogously defined over the MBGs $MBS(Y,MBG):mbg\to mbs$ and $MBM(Y,{X}_{i},MBG):mbg\to \{0,1\}$. Thus, the DAG posterior $\mathbb{P}(G{D}_{N})$ is not necessary to define the MBS posterior ($\mathbb{P}(MBS(Y,G){D}_{N})$) and MBM posteriors ($\mathbb{P}(MBM(Y,{X}_{i},G){D}_{N})$) because the MBG posterior ($\mathbb{P}(MBG(Y,G){D}_{N})$) is sufficient. Analogously to equation (13.2), the MBG posterior induces the MBS posterior and MBM posterior, because of the functional dependence of MBSs and MBM for each X_{i} on MBGs.
Consequently, we can formulate the central task of the Bayesian analysis of relevance using BNs as the estimation of the posterior distribution of MBGs. Note that the posterior of ksubMBSs and ksupMBSs can also be induced by the MBG posterior. However, the MBG posterior is not sufficient to induce the posteriors of relevance types in Tables 13.1 (page 324) and 13.2 (page 328), and therefore these should be estimated individually. Thus, in addition to estimation of the MBG posterior, the general task also includes estimation of the expectation of arbitrary binary random (p.330) variables $1(F({V}^{\prime},G)=f)$ over the space of DAGs, such as the indicator functions corresponding to relevance types in Tables 13.1 and 13.2.^{15} However, the exact computation of this expectation is generally not feasible due to the superexponential number of possible DAG structures; thus an approximation is applied using M randomly sampled DAG G_{i} from the posterior $\mathbb{P}(G{D}_{N})$:
There are several classes of methods designed to facilitate the estimation in equation (13.7). The Bayesian inference of structural properties of BNs was proposed in [7, 12]. In [37], Madignan et al. proposed a Markov chain Monte Carlo (MCMC) scheme, to approximate such Bayesian inference using an unnormalized posterior $\tilde{p}(G{D}_{N})$ and an auxiliary ordering of variables. In [22], Friedman et al. reported an MCMC scheme over the space of orderings. In [33], Koivisto et al. reported a method to perform exact full Bayesian inference over modular features.^{16} Recent applications in GAS are reported in [31, 61]. Another approach using PGMs applied the bootstrap framework to provide uncertainty measures for arbitrary subgraphs [20, 43]. To support Bayesian inference over relevance relations, we reported specialized ordering MCMC methods designed to efficiently estimate posteriors of structural model properties, particularly of Markov Blanket Graphs [3, 4].
13.3.1 Estimating the Posteriors of Complex Features
As we discussed in the previous subsection, the posterior probability of any structural feature $F({V}^{\prime},G)=f$ can be estimated by MCMC methods if efficient computation exists for an unnormalized posterior $\tilde{p}(G{D}_{N})$ and feature value $F({V}^{\prime},G)=f$ (see equation (13.7)). However, the estimation of posterior distributions of complex features with high cardinality, such as the MBS, MBG, and kMBS features, poses an additional challenge. Earlier we proposed a specialized ordering MCMC method to estimate MBG and MBS posteriors [3], but DAG MCMC methods are more suitable for general, potentially knowledgeintensive structural features; thus, we developed a twostep DAGbased process to estimate all the relations in the BNBMLA methodology.
We assume a complete data set D_{N}, discrete variables, multinomial sampling, Dirichlet parameter priors ($\mathbb{P}(\mathrm{\Theta}G)$), and a uniform structure prior ($\mathbb{P}(G)$). We used two types of Bayesian Dirichlet parameter priors: the first is defined by the Cooper–Herskovits hyperparameters, and the second is defined by the Bayesian Dirichlet equivalent uniform (BDeu) hyperparameters with virtual sample size equal to one (for their definition, see [12, 30]; for the effect of various virtual sample sizes, see [50, 57]). These assumptions allow the derivation of an efficiently computable unnormalized posterior over the Bayesian network structures G (i.e., over directed acyclic graphs, (DAGs) [12, 30].
Using this unnormalized posterior, we perform a special random walk in the space of DAGs G by applying an MCMC sampling method, which inserts, deletes, and inverts edges [25]. The probability to apply different DAG operators in the proposal distribution is uniform, the length of the burnin is 10^{6}, and the length of the sample collection (L) is $5\times {10}^{6}$.
(p.331) The MCMC process generates a dependent sequence of L DAGs D_{L^G}. Using the MCMC simulation, we estimate the posterior of the MBGs for the target variables according to equation (13.7) see Section 13.3 (page 328). In each MCMC step, we determine the boundary graph bd(Y,G) corresponding to the DAG G in this step and update the relative frequency of this boundary graph. (We recall that the presence of the boundary graph bd(Y,G) implies with probability one that the corresponding variables are the Markov boundary, see Subsection 13.2.2 (page 13.2.2).) This update is analogously done for all the pairwise relations in Table 13.1 (see Subsection 13.2.3, page 13.2.3) and in Table 13.2 (see Subsection 13.2.5, page 13.2.5); these relations are evaluated for every possible pair of variables at each MCMC step, and hence a counter is maintained for each possible instantiation of the given relation throughout the MCMC sampling. Note that there are no practical restrictions on the selection of target variables, i.e., within the same MCMC simulation we can evaluate multiple target sets simultaneously. The computational complexity of the evaluation of the structural features discussed in this chapter and the update of their relative frequency is O(n^{2}) (n denotes the number of variables). In case of a complex phenotype with multiple descriptors, we can use the descriptors together as a joint target set, and each of the descriptors individually. An occasionally practical inversion is to select a predictor variable as the target, because we can explore all the phenotypes relevant for this predictor (for the application of this inversion in the frequentist framework, see [34]). Hence the set of evaluated features can be fully fitted to the needs, queries, or preconceptions of the expert performing the analysis.
In a second, “posthoc”^{17} phase, we compute various MBS related, marginal posteriors from the MBG posterior estimated in the first phase. The MBS and MBM posteriors are computed exactly from the estimated MBG posterior analogously to equation (13.5) (see Section 13.3, page 328). The posterior of a given ksubMBS set or ksupMBS set can be directly computed from the MBS posterior according to equations 13.9 and 13.10 (see Subsection 13.3.4, page 13.3.4). To find the highly probable ksubMBS and ksupMBS sets, we apply greedy algorithms, because the cardinalities of these sets grow polynomially (${O}({n}^{k})$). In the case of ksubMBSs, the starting state of the greedy search is the empty set, which can be regarded as a trivial 0subMBS with probability one. The algorithm then expands the set into the (k+1)subMBS with maximal posterior. In the case of ksupMBSs, the initial state of the search is the complete set U, from which the algorithm iteratively eliminates predictors to get the (k1)supMBS with maximal posterior.
In the MCMC simulation, we also calculate various quantitative measures of convergence and confidence of the posteriors of complex features. The following set of measures can be regarded as a standard set:
1 the Geweke Zscore, measuring convergence within one chain, i.e., the significance of the difference of the posteriors between the beginning and the end of the sampling [24]

2 the Gelman–Rubin Rscore, measuring interchain convergence, i.e., significance of the difference of independent sampling processes [24]

3 confidence intervals, based on the standard error of the MCMC^{18}
(p.332) Fig. 13.4 demonstrates the convergence of the applied MCMC sampling with respect to the burnin phase. Note that these measures are different for each feature, i.e., the estimation of the MBM posteriors is typically faster than the estimation of the 3subMBS posteriors or the MBG posterior. The values of the measures are calculated in each step l of the MCMC simulation using the MCMC samples of steps 1, …, l.
In the following, we demonstrate the application of this approach on a case study. The study involves 1201 unrelated individuals from the Hungarian (Caucasian) population. Four hundred and thirtysix asthmatic children, aged 3 to 18 were recruited for the study. The control group consisted of 765 subjects (mean age: 19 years, 405 males/360 females). We used three embedded data sets: (1) the asthma status was known in all cases (1201 subjects, data set A); (2) in 1100 cases, the status of Rhinitis was also known (data set RA) (only those subjects whose rhinitis status had been verified by specialists); and (3) in 200 cases, the status of rhinitis and the serum levels of IgE and Eosinophil were involved in this data set were also known (data set CLI).
13.3.2 Sufficiency of the Data for Full Multivariate Analysis
Regardless whether a Bayesian conditional method (such as Bayesian logistic regression) or the BNbased approach is used, the posterior probability distribution of sets of predictors indicating their joint relevance with respect to the selected model class (such as the MBS posterior in case of BNs) is typically flat for the settings of contemporary GASs, because of the relation of sample size, number of predictors, effect sizes, and model complexity including prior. Such flat MBS posteriors, ranked from the maximum a posteriori (MAP) MBS to the less probable ones, are illustrated in Fig. 13.5, which shows that there are several sets with only slightly lower probabilities than that of the MAP set. These also indicate that the MAP MBS is not dominant, as its posterior is negligible. Furthermore, the cumulative distribution functions in Fig. 13.5 also indicate the lack of dominant MBSs, i.e., the lack of a small number of MBSs with high posteriors such that their summed posterior is close to one. It needs to be emphasized that these results are the consequences of the power of the data and not of the priors; thus, they also indicate the lack of dominantly optimal model in the frequentist, maximum likelihood approach. In the case of the data set RA and the Asthma target variable, the MAP set has only a probability of 0.010688. (p.333) Because of the smaller sample sizes, when Asthma and Rhinitis are the target variables (i.e., multitarget analysis), the probability of the MAP set is even lower, namely 0.007626. This phenomenon is even more pronounced in the case of data set CLI, where the corresponding probabilities of the MAP sets are 0.001496 (Asthma target), and 0.000073 (multitarget), respectively. These flat MBS posteriors are consistent with our earlier findings from the simulations in [4] suggesting that a 200sized sample in general induces a very flat posterior distribution (“small sample size”), whereas a 1000sized sample corresponds to a “medium sample size” with respect to our settings with 100 variables, which is typical in candidate GASs and partial genome screening studies.
The relation between the posterior value p of MBSs and the corresponding ranks r is well described by a power law (cf. Zipf’s law [60]) meaning that the posterior p is proportional to $1/{r}^{\mathrm{\alpha}}$, where α > 0 (Fig. 13.5). The heavytailed distribution and the lack of a dominant set or a set of highly significant sets indicate the necessity of refined approaches to cope with weakly significant results, e.g., Fig. 13.5 shows that the cumulative posterior of the 10 000 most probable MBSs is below 0.22.
13.3.3 Rate of Learning: Effect of Feature and Model Complexity
The MBM, ksubMBS/ksupMBS, MBS, and MBG features form a hierarchy of increasing cardinality ($\text{MBM}<k\text{MBS}<\text{MBS}<\text{MBG}<\text{BN}$), which is related to the learnability of these levels (for the sample complexity of learning BNs, see [23]). We investigated the effect of this increasing cardinality and computed various performance measures to characterize the learning rate of the simplest and other more complex structural features about strong relevance, namely MBM, MBS, and MBG features. To do this, we performed systematic studies varying the sample size and model complexity using an artificial reference model M_{0} and an artificial data set generated from this model.
The artificial reference model M_{0}, used in our experiments throughout this chapter, is the maximum a posteriori Bayesian network learned from data set A. In the reference model M_{0}, the Markov blanket of the Asthma target variable consisted of 15 variables. Subsequently, we (p.334) randomly generated an artificial data set from M_{0}, which contained 10 000 independent, identically distributed, random samples from the estimated distribution. From this data set, we created smaller ones by simply taking the first 50, 100, 250, 500, 1000, 2000, and 5000 samples, respectively. Additionally, we defined embedded sets of variables with sizes 20, 40, 60, 80, and 100 (${s}_{20},{s}_{40},{s}_{60},{s}_{80},{s}_{100}$) to emulate different model complexities.^{19} These sets always contained the 15 reference input variables and the target variable, which were supplemented with randomly drawn variables not in the reference MBS. Fig. 13.6 shows the effect strengths (oddsratios^{20}) of the predictors in the reference model M_{0}. Note that there are many weak associations in the reference model M_{0}, i.e., in the realworld data set.
The relative flatness of a posterior at a given level generally indicates uninformativeness, i.e., high level of uncertainty; thus the inapplicability of a given level, such as the inapplicability of the level of MBGs, MBSs, or ksubMBSs/ksupMBSs above a given k. The general uncertainty of a posterior distribution can be characterized by its entropy,^{21} which will be high for a flat, nearly uniform, noninformative posterior distribution. Fig. 13.7 shows the entropy of the distribution of the MBM and MBS features for various model sizes and sample sizes.
(p.335) We now present results about learning rates using sensitivity (Fig. 13.8A), false discovery rate (FDR, Fig. 13.8B), misclassification rate (Fig. 13.9B), and area under the Receiver Operating Characteristic (ROC) curve (AUC; Fig. 13.9A).^{22} Results are calculated at an acceptance threshold of $0.5<\mathbb{P}(MBM(Y,{X}_{i},G){D}_{N})$; i.e., this is the probability above which we accept a factor X_{i} to be relevant.
The trends indicated in Figs. 13.8 and 13.9 (page 336) are summarized in Table 13.3 (page 336). For each performance measure one can select thresholds for acceptable and very good performances, which allow the definition of “small” and “large” sample sizes as the minimal number of samples to reach these thresholds. These thresholds can be selected in our case with a given number of variables and model complexity as follows: 0.6 and 0.9 for AUC, 0.1 and 0.5 for sensitivity, and 0.5 and 0.1 for false discovery rate. The inverse problem of selecting an optimal decision threshold for a given sample size is discussed in Section 13.4. The relatively modest performance and relatively high “small” and “large” sample sizes are the consequences of the abundance of weak associations in the reference model M_{0}.
Table 13.3 The “small” and “large” sample sizes for various performance measures (the decision threshold for MBM posteriors is 0.5).
Sensitivity 
FDR 
AUC 


threshold 
N 
threshold 
N 
threshold 
N 

“small” sample size 
0.1 
100 
0.5 
250 
0.6 
250 
“large” sample size 
0.5 
2000 
0.1 
5000 
0.9 
5000 
Note: AUC, area under the receiver operating characteristic curve; FDR, false discovery rate.
(p.336) These results are good examples of our experiences from other candidate GASs, which suggest that if the ratio of the number of samples and the number of variables is below ten then the MBS posterior is very flat but the MBM posteriors show peaks. These observations indicate the necessity of the intermediate level of ksubMBSs and ksupMBSs.
13.3.4 Bayesian networkbased Bayesian Multilevel Analysis of Relevance
In Subsection 13.2.1 (page 320), we defined different relevance types that can be used to infer strongly relevant variables, either separately from each other (MBM) or jointly in a complete set (i.e., in an MBS). Moreover, in case of the MBGbased relevance type even the interactions among those strongly relevant variables can be investigated. In this section, we provide some characteristics of these relevance types and show how each of them can be used in a genetic association study to reason about the relevance of the predictors.
Two complementary approaches can be distinguished in relevance analysis depending on their focuses. The first approach provides an overall characterization of relevance relations using a limited number of features and feature values, such as pairwise edges and Markov blanket membership relations [20, 21]. The second approach is at the other extreme of feature learning as it provides a detailed characterization of dependence relations between the target variables and the (p.337) predictors. Here we find the identification of MBSs and MBGs, which is also related to the learning of arbitrary subgraphs with statistical significance [43]. To understand this whole spectrum, first recall that the MBM, kMBS, MBS, and MBG features are of increasing complexity. Consequently, the MBG and MBS posterior distribution is often too “flat” (i.e., there are hundreds of MBS or MBG features whose summed probability is relatively high but whose individual posteriors are low) even when the MBM posterior distributions $\mathbb{P}(MBM(Y,{X}_{i},G){D}_{N})$ show peaks. To illustrate the relation between the two approaches, we report the approximation of the (setbased) MBS posterior $\mathbb{P}(mbs(Y,G){D}_{N})$ using the (pairwise) MBM posteriors as follows:
Fig. 13.10 shows that the MBMbased approximation allows only a rough quantitative estimation, and the corresponding ranking differs significantly (the difference is particularly pronounced in most practical cases when the sample is relatively small).
The Bayesian multilevel analysis of relevance introduced scalable intermediate levels to provide a comprehensive view over multiple levels. It was motivated by the observation that typically—even when the MBG and MBS posterior distributions are flat—the most probable MBSs and MBGs share significant common patterns. We introduced the concept of subrelevance, denoted in this chapter as ksubMBS (see Definition 13.6 (page 326) and [4]), to characterize the common elements. Typically, these common variables are present in MBSs with high posteriors, and they usually have larger effect sizes. The posterior probability of the subrelevance of a subset s is:
where the first term is the exact MBS posterior of s and subsequent terms, and behind the summation sign, are the MBS posteriors of each proper superset of s. The posterior $\underset{\_}{p}(s{D}_{N})$ expresses the probability that s contains only strongly relevant variables. If s is the common subset of almost all (p.338) highly probable MBSs, then its subrelevance posterior will be high. On the contrary, if s is not the subset of the most probable MBSs, then the posterior will be low.
We demonstrate the use of the ksubMBS concept in the Asthma domain. In this domain, the MBS posterior distributions were very flat, whereas the MBM posterior distributions were very rough, which indicates that analysis at intermediate levels of ksubMBSs can unhinge significant results. Thus, we evaluated the partial multivariate results illustrated in Fig. 13.11. In the cases of k = 1,2,3,4, the high maximum a posteriori probability (which corresponds to relatively rough posterior distribution) indicates that the sample size is sufficient to infer that these variables are strongly relevant jointly. In contrast, for k > 4, the maximum a posteriori multivariate features are weakly significant. These results are in line with the expectation that, with increasing feature cardinality, the posterior distribution progressively flattens out. The posteriors corresponding to the polynomially increasing cardinality of the ksubMBS bridge the gap between the flat posterior of the MBS and the MBM posteriors characterized by the presence of numerous peaks.
Analogously, we define the concept of ksupMBS to characterize the not strongly relevant variables typically not present in MBSs. The posterior probability of suprelevance is computed analogously to that of subrelevance (equation (13.9)):
The posterior $\stackrel{\u02c9}{p}(s{D}_{N})$ expresses the joint probability that all strongly relevant variables are in set s (${X}_{i}\in s$) (equation 13.11). Equivalently, none of the ${X}_{i}\notin s$ are strongly relevant (equation (13.12)):
In summary, a ksubMBS with high posterior contains variables that are strongly relevant with high confidence (“necessary” variables), and the complement of a highly probable ksupMBS (p.339) contains variables that are not strongly relevant with high confidence (i.e., ksupMBS contains a “sufficient” set of variables). The strong relevance of any individual ${X}_{i}\notin s$ can be excluded with this probability (although $1\mathbb{P}(MBM(Y,{X}_{i},G))$ gives a sharper value as the exact posterior probability of being not strongly relevant).
Note that for a given precedence (ordering) of variables ≺, the suprelevance $\stackrel{\u02c9}{p}$ for k = 0,1, … increases monotonously from 0 to 1, and $\underset{\_}{p}$ decreases monotonously from 1 to 0 as we progressively extend the empty set to the complete set following that precedence (see Fig. 13.12).
13.3.5 Posteriors for Multiple Target Variables
The definition of strong relevance to multiple targets is a straightforward extension from strong relevance to a single target variable (see Subsection 13.2.5, page 13.2.5, Definition 13.7). However, the posterior for a given target set Y cannot be calculated in general from the posteriors corresponding to the members of any partitioning of $Y=\bigcup _{i}{Y}_{i}$, although the posteriors corresponding to subsets of the target set can be used for an approximation. In the case of MBMs, the corresponding approximation is:
We computed and compared the posteriors of each directly relevant variable X_{j} to each target variable Y_{i} in order to decompose the relevance of genetic factors to various phenotypic target variables. These targets were IgE (level), Eosinophil (level), Rhinitis, and Asthma, which are believed to participate in a complex causal model with multiple paths (see Fig. 13.3, page 327). Since in this problem domain the target variables form a causal model with strong dependences, we can expect that the relevance of a factor X_{j} to a given target is not independent from that to (p.340) other targets. Indeed, the multitarget approach is motivated by the finding that equation (13.13) provides a rather poor approximation (Fig. 13.13). Note that the value of posteriors for multiple targets is higher than that for single targets due to the possibility of relevance to any of the targets, and so a quantitative comparison correction is necessary.
Using the rs17831682 SNP in the PTGDR gene as an example (see PTGDR(1) in Fig. 13.3), we demonstrate the main advantages of the multitarget approach, namely that it allows us to distinguish between subtypes of multitarget relevance, which were summarized previously in Table 13.2 (page 328). When multitarget relevance is ignored, the posterior of strong relevance (i.e., the MBM posterior) of rs17831682 to IgE (level), Eosinophil (level), Rhinitis, and Asthma is 0.58, 0.52, 0.53, and 0.53, respectively, which indicates a somewhat moderate relevance to each target. The posterior for being strongly relevant to at least one of them (MBMToAny relation in Table 13.2) is higher: 0.71 (the approximation according to equation (13.13) is 0.95). However, the posterior probability that rs17831682 is strongly relevant exclusively to IgE, Eosinophil, Rhinitis, or Asthma (MBMToExactlyOne relation) is only 0.06, 0.04, 0.05, and 0.05, respectively, which shows that this SNP is probably related to multiple targets. This hypothesis is also supported by the posterior that this SNP is strongly relevant to other targets but not to IgE, Eosinophil (level), Rhinitis, or Asthma (MBMWithOthers relation): 0.37, 0.42, 0.42, and 0.42, respectively. Finally, the posterior for rs17831682 being a relevant SNP for multiple phenotypic targets (MultipleMBMs relation) is high (0.51), suggesting that this SNP is strongly relevant to the set of targets and that the SNP plays roles in multiple mechanisms.
13.3.6 Subtypes of Strong and Weak Relevance
The distinction between different types of relevance is essential in revealing the possible causal and mechanistic path(s) that connect the relevant SNP to its target variable(s). Estimating the posterior probability of various relevance types enables us to decide whether a SNP is directly relevant or its association is mediated by other factors or both. We demonstrate the dissection of relevance types in a BNBMLA analysis conducted on data set RA, which contains two phenotypic variables: Asthma and Rhinitis. Using Asthma as the sole target, the following posteriors were (p.341) estimated for each SNP: direct causal relevance (DCR), association (A), strong relevance (SR), interactionist relevance (IR), and indirect causal relevance (ICR). The corresponding posteriors for a number of SNPs are shown in Table 13.4.
Table 13.4 Posteriors for various types of relevance based on data set RA using Asthma as the only target. Each line contains the posteriors for a SNP identified by the first column, e.g., PTGDR(2) refers to the second SNP measured within the PTGDR gene. Subsequent columns correspond to relevance types in the following order: association (A), direct causal relevance (DCR), strong relevance (SR), interactionist relevance (IR), indirect causal relevance (ICR).
SNP 
A 
DCR 
SR 
IR 
ICR 

AHNAK(2) 
0.643 
0.029 
0.736 
0.708 
0.535 
TXNDC16(1) 
0.309 
0.008 
0.722 
0.713 
0.189 
PRPF19(1) 
0.878 
0.718 
0.843 
0.125 
0.822 
PTGDR(2) 
0.923 
0.326 
0.362 
0.035 
0.747 
PTGER2(2) 
0.970 
0.350 
0.354 
0.004 
0.604 
WDHD1(1) 
0.964 
0.115 
0.242 
0.127 
0.202 
In our current example, the SNPs can be clustered into four groups as shown in Fig. 13.14 (page 13.3.7). Note that association, direct relevance, transitive relevance, and interactionist relevance are complex, potentially overlapping events (see Fig. 13.2, page 325). The SNPs AHNAK(2) and TXNDC16(1) both have a moderately high posterior for strong relevance (0.736 and 0.722) but a very low direct causal relevance posterior (0.029 and 0.08). This means that the strong relevance of these SNPs to Asthma is not due to direct causal relationship but to a purely interactionist relevance with Rhinitis. Furthermore, the fact that the posterior for a transitive relationship with Asthma is relatively low (the posterior is 0.535 and 0.189 for AHNAK(2) and TXNDC16(1), respectively) indicates that interactionist relevance (posterior: 0.708 and 0.713) is the only relevance subtype underlying the association of these SNPs to Asthma. This means that these SNPs are relevant and associated only if the Rhinitis state is known.
In contrast, PRPF19(1) is associated with Asthma (0.822) not only transitively but also through a direct causal relationship (0.718) suggesting two distinct causal paths connecting PRPF19(1) to Asthma: one of the paths can be blocked by other factors but the other one cannot. Its posterior for interactionist relevance is low (0.125), suggesting the lack of a third causal path.
In the third group of SNPs, PTGDR(2) and PTGER2(2) have a very high probability of association with Asthma (0.923 and 0.970 respectively), which is induced by a transitive relation indicated by moderately high TCR posteriors (0.747 and 0.604). Note that all other posteriors are relatively low, indicating that TCR is the only significant relevance type in this case.
WDHD1(1) contrasts with all other SNPs in the previous groups because it has a high probability of association with Asthma (0.96) but none of its other posteriors are significant. This is possible in the case of a pure confounding relationship, in which a common cause influences both the SNP and the target (which are independent from each other otherwise). Note that if transitive (p.342) dependence and confounded dependence cannot be differentiated, such as in the case of linked SNPs, the transitively relevant and the confounded groups can be merged.
For a more detailed biomedical discussion of the application of this methodology in asthma and allergy, we refer the reader to [58].
13.3.7 Interactionredundancy Scores Based on Posteriors of Strong Relevance
The MBM posterior probabilities $\mathbb{P}(MBM(Y,{X}_{i},G))$ characterize the strong relevance of individual variables ${X}_{i}\in V$ (equation (13.2), page 329). However, $\mathbb{P}(MBM(Y,{X}_{i},G))$ inherently reflects multivariate aspects as it is derived from a full Bayesian multivariate analysis through marginalization from the MBG posterior distribution (equation (13.6), page 329). It means that even a variable X_{i} which is independent of variable Y can have high MBM posterior. The reader is referred to Subsection 13.2.3 (page 323) for a discussion of Definition 13.5 (page 325) regarding the relationships connecting interactionist relevance, contextual relevance, and the relevance of a variable with no or negligible main effect.
(p.343) However, the exploration of statistical interactions is an additional challenge, because the conceptualizations of statistical interactions are based on the joint, nonlinear effect of the interacting variables (for overviews, see e.g., [14, 45]; for methods, see e.g., [35, 40, 49, 65]). In the BNbased Bayesian framework, the joint relevance of the set of variables is quantified by its MBS posterior. Thus, we can define an interactionredundancy score based on the decomposability of the posterior of the strong relevance of a set of variables [4].
Definition 13.8
For a given data set D_{N} with sample size N, the features ${X}^{\prime}=\{{X}_{{i}_{1}},\dots ,{X}_{{i}_{k}}\}$ are structurally interacting (redundant), if the posterior $\mathbb{P}(k\text{subMBS}({X}^{\prime},Y,G){D}_{N})$ is larger (less) than the MBM posteriorbased product $\prod _{j=1}^{k}\mathbb{P}(MBM({X}_{{i}_{j}},Y,G){D}_{N})$.
The interactionredundancy is quantified by the following score
The interactionredundancy score is illustrated in Fig. 13.15. This modellevel approach to interaction and redundancy formalizes the intuition that relevant input variables with decomposable roles at the parametric level appear independently in the model. If the ksubMBS posterior of set s is larger than its approximation based on MBM posteriors according to equation (13.8) and equation (13.9), it may indicate that the variables in set s have a joint parameterization expressing nonlinear joint effects. In contrast, in the case of ksubMBS including redundant variables, the (p.344) posterior is smaller than its approximation based on MBM posteriors, because the joint presence of the redundant variables in the model is suppressed.
Note that the interactionredundancy scores corresponding to a given target do not appear to be related to the genetic linkage between the SNPs. Fig. 13.15 clearly shows that there are several intragenic, intrachromosomal, and interchromosomal interactions in this domain exemplified by rs17197 and rs708502 in the PTGER2 gene (chromosome 14), rs12587410 in the PTGER2 gene and rs376966 in the DLG7 gene (both genes in chromosome 14), and rs11827029 in AHNAK (chromosome 11) and rs17831675 in the PTGDR gene (chromosome 14).
13.4 Bayes Optimal Decisions about Multivariate Relevance
The relatively high number of predictor variables in GASs poses a serious challenge, because of the multiple hypothesis testing problem: in the univariate approach, the number of hypotheses is linear in the number of variables. Additionally, the number of hypotheses can be exponential in the multivariate approach using a complex model class. Several approaches have emerged within the frequentist framework to handle the multiple hypothesis testing problem in both the univariate and the multivariate contexts. These approaches include correction methods, permutation testbased approaches, and involve concepts such as false discovery rate (FDR) and qvalue^{23} [54].
Because of its direct semantics, the Bayesian multivariate approach has a builtin automated correction for the multiple hypothesis testing problem: the posterior typically is more flat with increasing number of variables and with increasing model complexity, i.e., in a more complex hypothesis space.
Furthermore, the Bayesian decision theoretic framework permits optimal decisions about model properties, such as the optimal scientific reporting of results or the optimal continuation of the study (for Bayesian study design using BNs, e.g., see [2, 62]). First, we summarize the optimal decision problem of the relevance of variables based on their univariate posteriors and utilities. Second, we show the application of the Bayesian approach to construct a Bayesian FDR. Third, we consider the use of general informativeloss functions.
13.4.1 Optimal Decision about Univariate Relevance
As we discussed in Subsection 13.3.4 (page 336), the univariate view of relevance using the MBM posteriors allows an overview, but it provides only a raw approximation for the MBS posteriors (see Fig. 13.10, page 337). We can construct a 2× 2 cost matrix for the errors of a positive/negative decision about univariate strong relevance, when the real status of relevance is positive/negative: ${C}_{10}$ is the cost of the false positive decision, ${C}_{01}$ is the cost of the false negative decision, and for (p.345) simplicity we assume that the other costs are ${C}_{00}={C}_{11}=0$. The positive decision about the strong relevance of a given variable X_{i} with respect to the target variable Y is optimal if
where $\mathrm{\tau}\in [0,1]$ is the optimal decision threshold. For example, if the cost ${C}_{10}$ of doing further measurement on a nonrelevant variable (a false positive decision) is equal to the cost ${C}_{01}$ of missing a discovery (a false negative decision), then the optimal decision threshold τ is 0.5.
Fig. 13.16 (page 346) shows the ROC curves of the MBMbased decision of relevance using the reference model M_{0} from Subsection 13.3.3. The ROC curves show the sensitivity and specificity values for various sample sizes, considering reference variable sets ${s}_{20},{s}_{40},{s}_{60},{s}_{80}$, and s_{100} with varying number of variables.
13.4.2 Optimal Bayesian Decision to Control FDR
The measures of classification performance, such as sensitivity, FDR, and AUC are valuable tools, but they need an external reference, a “gold standard,” which is typically available in an evaluation context (for a recent comparison of measures, see e.g., [55]). The classical frequentist approach also assumes that there is an unknown reference set, the “true model” beneath our data. However, the Bayesian framework offers a natural solution for the lack of a reference model, based on Bayesian model averaging (BMA).
In the decision theoretic framework, sensitivity, specificity, FDR, negative predictive value (NPV) and other performance measures can be interpreted as special utility or cost functions based on the selected set $\stackrel{\u02c6}{s}$ and the possible set s (for which posterior exists). As an example for a cost function, FDR is the proportion of members in the selected set $\stackrel{\u02c6}{s}$ not present in the possible set s, and sensitivity (i.e., true positive rate, TPR) is the frequency of members in the selected set $\stackrel{\u02c6}{s}$ present in the possible set s:
In the Bayesian framework, the corresponding expected values (denoted with upper bar) are well defined and can be approximated using MCMC simulation (see Subsection 13.3.1, page 330). For example, the expected value of the FDR corresponding to the decision for selecting $\stackrel{\u02c6}{s}$ is as follows:
This calculation sums over possible scenarios with pairs $(\stackrel{\u02c6}{s},s)$, where the set $\stackrel{\u02c6}{s}$ is fixed and the possible set s is changing. The probability of a scenario is defined by the posterior $\mathbb{P}(s{D}_{N})$ of the (p.346) corresponding possible set. Because the loss is defined by $FDR(\stackrel{\u02c6}{s}s)$, equation (13.20) defines the expected loss for a selected $\stackrel{\u02c6}{s}$, allowing a search for an optimal set $\stackrel{\u02c6}{s}\ast $ minimizing the expected FDR loss.
Fig. 13.17 shows the expected values of these measures using data set A and the Asthma target.
(p.347) The availability of the FDR value for any set allows the definition of the Bayesian discovery threshold called the dvalue, a Bayesian analogue of the qvalue^{24} [54] from the frequentist approach [54].
Definition 13.9
The dvalue for a given variable X_{i} is the minimal expected FDR value corresponding to the multivariate selection of a relevant set s including X_{i}: $d\text{value}({X}_{i})=\underset{s:{X}_{i}\in s}{min}\stackrel{\u203e}{FDR}(s)$.
Note that the expected FDR value of a set $\stackrel{\u02c6}{s}$ containing only one variable X_{i} is $1\mathbb{P}(MBM(Y,{X}_{i}))$, which is minimal for the variable with the maximal MBM posterior value (${X}^{MBM}=arg\underset{{X}_{i}}{max}\mathbb{P}(MBM(Y,{X}_{i}){D}_{N})$). However, the MBS posterior of this set $\stackrel{\u02c6}{s}=\{{X}^{MBM}\}$ is typically very small, and sets with high MBS posteriors typically contain further variables with high MBM posteriors. The identification and construction of the set $\stackrel{\u02c6}{s}$ with minimal FDR is supported by the observation that FDR can be decomposed as follows:
where ${\stackrel{\u02c6}{s}}_{i}$, s_{i} denote the presence of the predictor X_{i} in the set (as always, the capital S_{i} denotes the corresponding random variable).
In fact, the expected value of any linear performance measure $U(\stackrel{\u02c6}{s}s)$ can be computed efficiently, because the following proposition holds:
Proposition 13.1
Given the linear utility function $U(\stackrel{\u02c6}{s}s)=\sum _{i}{\mathrm{\lambda}}_{i}{U}_{i}({\stackrel{\u02c6}{s}}_{i}{s}_{i})$, the expected value of a selected set $\stackrel{\u02c6}{s}$ is $EU(\stackrel{\u02c6}{s})=\sum _{i}{\mathrm{\lambda}}_{i}EU({\stackrel{\u02c6}{s}}_{i})$, where ${\stackrel{\u02c6}{s}}_{i}$ and s_{i} denote ${X}_{i}\in \stackrel{\u02c6}{s}$ and ${X}_{i}\in s$ respectively, and ${\mathrm{\lambda}}_{i}\in \mathbb{R}$.
Consequently, as the following theorem shows, the dvalue(X_{i}) can be defined at the univariate level of the predictors using only their MBM posteriors $MBM(Y,{X}_{i})$. Furthermore, the optimal set contains the variables with maximal univariate relevance.
Theorem 13.3
Because of the linear property of the FDR and proposition 13.1, the dvalue(X_{i}) is $\frac{1}{s}\sum _{{X}_{j}:{X}_{j}\in s}1\mathbb{P}(MBM(Y,{X}_{j}))$, where the optimal set s contains X_{i} and all X_{j} with higher posteriors (i.e., $s=\{{X}_{i}\}\bigcup \{{X}_{j}:\mathbb{P}(MBM(Y,{X}_{i}))<\mathbb{P}(MBM(Y,{X}_{j}))\}$).
Fig. 13.18 shows the expected values of FDR for sets ${s}_{MBM,k}^{\ast}$ with increasing size containing k variables with highest MBM posteriors. Because of Theorem 13.3, these are the dvalues. Note that these sets are typically different from the most probable MBSs (sets s with high MBS posterior, because of the difference between the MBS posterior ($\mathbb{P}(MBS(Y,G)=s{D}_{N})$) and its MBM posteriorbased approximation in equation 13.8 (see Fig. 13.10, page 337). This is because of the potential dependences of the univariate relevances $MBM({X}_{i})$ and the estimation errors of the finite MCMC sampling process.
13.4.3 General Bayes Optimal Decision about Multivariate Relevance
The most important factor in a general utility/loss function $U(\stackrel{\u02c6}{s}s)$ designed to select the set of variables $\stackrel{\u02c6}{s}$ is the penalty for the difference between sets $\stackrel{\u02c6}{s}$ and s using various loss functions, e.g., the “0–1” (L_{0}), linear (L_{1}), or quadratic (L_{2}) losses. The effects of these loss functions can be illustrated in the continuous domain, where the optimal values corresponding to these losses are (p.349) the maximum, median, and average values, respectively. For example a quadratic loss function with uniform weights and its corresponding expected value is defined as
where ()^{c} denotes the complement and the capital S denotes the corresponding random variable. Note that the terms in equation (13.26) are the true positive, true negative, false negative, and false positive counts, which directly correspond to the performance measures shown in Fig. 13.18 (see equation (13.6)). Fig. 13.19 (page 350) illustrates the nonmonotonic relation of the MBS posteriors and the expected utilities corresponding to equation (13.26).
The MBS posteriors (yaxis) are plotted against the ranks (xaxis) of the sets, where ordering is defined by their expected L2 utilities (see equation (13.26)).
Other factors in a general utility/loss function $U(\stackrel{\u02c6}{s}s)$ can incorporate and score many aspects to select the set $\stackrel{\u02c6}{s}$. Such aspects are as follows:
1 Statistical measures: To characterize the statistical aspects of the selected set, we can combine factors such as FDR and NPV from equation (13.16) and equation (13.20).

2 Univariate importance: Depending on the domain, the cost of false positive and false negative errors can be defined separately for each variable X_{i}, i.e., using a linear utility function from Proposition 13.1 (page 348).

3 Structural aspects (interactions): The utility of set $\stackrel{\u02c6}{s}$ can also be defined based on the structural aspects of the reference using either the full network $(U(\stackrel{\u02c6}{s}G))$ or an mbg subgraph $(U(\stackrel{\u02c6}{s}mbg)$), (p.350) e.g., to express that set $\stackrel{\u02c6}{s}$ as a set of variables includes factors of interactions represented by the graphical models.

4 Informative consistency^{25} and completeness^{26}: The internal consistency and completeness of set $\stackrel{\u02c6}{s}$ can be quantified based on domainspecific knowledge, thus providing a factor $U(\stackrel{\u02c6}{s})$ without referring to the possible set. It can, for example, express that predictors in set $\stackrel{\u02c6}{s}$ correspond to and completely cover one or more biological entities or mechanisms.
13.5 Knowledge Fusion: Relevance of Genes and Annotations
In genetic association studies, a key challenge is the incorporation of a wide range of information to transform and interpret the data at higher abstraction levels, such as at the level of haplotypes, genes, and proteins through “positional” or “functional” integration; at the level of pathways and cascades through “functional” integration; and “temporal” integration of earlier results to approximate an ideal metaanalysis. Whereas the fusion of knowledge at the level of genes, proteins, and pathways is an open research question, the data analysis workflow suggests the following scenarios. Fusion can be done either in the preprocessing phase by creating new synthetic variables to be used in the data analysis, or in the interpretational (postprocessing) phase. The haplotype level analysis is a wellknown example for the first approach, because it is regularly done offline in a preprocessing phase or integrated into the data analysis itself (see e.g., HapScope [64], HAPLOT [27], GEVALT [17], and PHASE [53]). However, we can also construct synthetic variables ${f}_{S}(\underset{\_}{X})$ at the level of genes, proteins, and pathways, to represent various functional effects analogously to the preprocessing approach with haplotypes. If we use the synthetic variables (p.351) in the Bayesian relevance analysis, then we can estimate the posterior over their joint strong relevance, and the posterior for all the partial relevances, e.g., univariate relevances $\mathbb{P}({f}_{S}{D}_{N})$ can be estimated.
In the postprocessing phase, a more common approach is to use synthetic variables ${f}_{S}(X)$ only to support interpretation. The mapping of the genetic variants to genes and then the use of Gene Ontology (GO) above the gene level is a practically important case. The most commonly used approach in the frequentist framework is to combine pvalues for SNPs into an overall significance level to represent a gene and to combine pvalues for the genes into an overall significance level to investigate the association of a pathway with the disease (such combination methods are the Fisher’s, Sidak’s, or Sime’s combination tests [44]). Note that these combination methods are vulnerable to correlations between the entities, and they cannot cope with complex interactions.
The Bayesian framework and the Bayesian networks offer a principled solution for the multivariate propagation of uncertainties from the level of strong relevance of the predictors to the annotations of the predictors and to the taxonomies over the predictors, i.e., to induce a posterior over related terms from the MBS posterior allowing uncertain relations in the annotations and taxonomies as well.
Let us assume a taxonomic tree $\mathcal{T}$, where each leaf node corresponds to a given indicator variable representing the strong relevance of a given variable in the analysis, the internal nodes correspond to indicator variables of terms in the taxonomy, and the edges represent the taxonomical relations directed from specific to general forming a polytree, e.g., SNPs to genes. For each term ${A}_{j,\mathcal{T}}$ in the set of terms ${\mathcal{A}}_{\mathcal{T}}$ present in the taxonomic tree $\mathcal{T}$, the tree defines a corresponding indicator function over the predictors ${X}_{i}\in X$ ${A}_{j,\mathcal{T}}({X}_{i}):X\to \{0,1\}$ (if context allows, we omit the index $\mathcal{T}$ and use the term and its indicator function interchangeably):
Based on these relations, a multivariate “semantic strong relevance” relation can also be defined representing the semantically related annotations ${s}^{{A}}$ for a given set of predictors s, ${SSR}_{\mathcal{T}}(s):{2}^{X}\to {2}^{{A}}$ as follows
We will generalize this logical taxonomy to a BN, but the current definition in equation (13.29) simply states that if the predictor X_{i} in set s is connected to the term A_{j} through a directed path, then A_{j} belongs to the set of “semantically related annotations” associated with s.
In the Bayesian statistical framework, the multivariate MBS posterior over sets of predictors ($\mathbb{P}(MBS(Y,G)=s{D}_{N})$) induces a multivariate posterior distribution over the sets of annotations ${s}^{{A}}$ as follows:
where the posterior of a given annotation/term set ${s}^{{A}}$ is the sum of the posterior of the sets of predictors s for which ${SSR}_{\mathcal{T}}(s)={s}^{{A}}$.
(p.352) The posterior probability of an annotation A_{j} is as follows:
The interpretation of equation (13.31) can be illustrated by the following example. Let us assume that A_{j}() represents the correspondence of SNPs to a given gene g_{j}. Then equation (13.31) follows the intuition that the posterior relevance of the gene g_{j} is the posterior probability that at least one SNP in this gene is strongly relevant. The multivariate transformation defined in equation (13.31) is typically different from various ad hoc averaging and maximizationbased aggregations of the MBM posteriors corresponding to the leaves of the annotation term. This is the consequence of the fact that the multivariate approach exactly takes into account the dependences between the strong relevances of the predictors corresponding to the leaves.
The structure of the taxonomy and prior domain knowledge can also be used to refine the semantic relevance relations. We can interpret the taxonomy $\mathcal{T}$ as a special BN, where the local parametric models are logical OR relations. In this model, the posterior for the multivariate semantic strong relevance relation can be interpreted as the result of an inference process with hard evidences at the leaves, which correspond to the indicator variables representing the strong relevance of analyzed variables. However, this BN representation, which mixes predictors (e.g., SNPs) and terms, allows the incorporation of more background knowledge, e.g., using NoisyOR local parametric models, in which the effect of the true state of a given input is inhibited with a given “inhibition probability” [41]. The parameters in the NoisyOR models can represent the in and outdegrees in the taxonomy, e.g., if a given term is annotated by many genes, and therefore its indegree is relatively high, then the parameters are set to smaller values to adequately model the generality of the term. Similarly, if a given gene is annotated by many terms, and therefore its outdegree is relatively high, then the parameters can be set to smaller values to model the higher frequency of the gene.
In Fig. 13.20, we demonstrate the results of the aggregation from the level of SNPs to the level of Gene Ontology biological process terms. The posterior probability of the MBSs calculated from data set A was aggregated to the level of genes by considering the physical loci and the functional roles of the SNPs. Then, we aggregated these results to the level of GO terms considering the annotations of the genes. The results can be visualized as a network whose nodes are the functional terms, and the connections between the nodes correspond to the hierarchy of the ontology. The sizes of the nodes are proportional to the posterior probability that the given functional term represented by the node has a functional role in the biological phenomena under investigation.
13.6 Conclusion
The Bayesian approach provides a unified framework for study design, integrated exploratory data analysis, optimal decision, and knowledge fusion in genetic association studies. Probabilistic graphical models, and particularly Bayesian networks, allow the decomposition and refinement of the overloaded concept of association. Bayesian networks in the Bayesian framework allow the inference of posteriors over multivariate strong relevance, interactions, global dependence, and causal relations, optionally with various specializations for multiple targets. Furthermore, the Bayesian networkbased Bayesian MultiLevel Analysis (BNBMLA) of relevance in GAS allows (p.353) scalable intermediate levels between univariate strong relevance and full multivariate relevance to interpret the results at partial multivariate levels; additionally, at each level, relevance can be analyzed from a dual point of view of necessity (ksubMBS) and sufficiency (ksupMBS).
The application of the Bayesian decision theoretic framework for the results of BNBMLA from the data exploration phase opens up new possibilities to incorporate domain knowledge in supporting interpretation and potentially automating the discovery of interesting relations. The Bayesian framework also allows the principled and computationally efficient management of FDR and other performance measures.
The Bayesian statistical framework also offers a normative solution for the multiple hypothesis testing problem, which is caused by the high number of predictors and particularly by the number of interactions within the frequentist framework. This statement also remains true for the much richer hypothesis space of the new relevance relations defined in the language of BNs, such as the MBMs, ksubMBSs/ksupMBSs, MBSs, and MBGs. Within the Bayesian framework, the moreorless flat posterior is the consequence of the high number of variables and high number of models, which is an analogue of the loss of power in the frequentist framework, because of the corrections for the high number of variables and high number of models. However, there is a fundamental difference between the two approaches, which is very valuable in the biomedical application: the Bayesian approach, specifically Bayesian model averaging,^{27} provides a normative method for the derivation of posteriors for complex hypotheses, such as ksubMBSs/ksupMBSs, MBSs, MBGs or the semantic strong relevance. This is especially important in data and knowledge fusion, which is the main bottleneck in current biomedical/translational research.
(p.354) Nonetheless, the posterior estimates for BN features, such as MBMs, ksubMBSs/ksupMBSs, MBSs, and MBGs, still suffer from the multiple hypothesis testing problem, because the MCMC process itself, i.e., their estimation, is done in the frequentist framework. But this problem is related mainly to the efficiency and length of the MCMC simulation, i.e., to the sampled DAGs in D_{L^G}, and not to the data set D_{N}. In other words, the Bayesian statistical framework transforms the statistically rooted multiple hypothesis testing problem into a computational task.
Fusion is an already wellrecognized central challenge in genetic association studies. With the spread of nextgeneration sequencing technologies targeting rare variants, the importance of fusion will further increase. Genetic factors have a hierarchical taxonomy, starting from SNPs and moving up to genes then GO terms and pathways. We can expect a similar hierarchical taxonomy to emerge over the phenotypic descriptors as well, such as the Human Phenotype Ontology [46]. Because genetic factors are typically predictors and phenotypic descriptors are typically targets in the BNBMLA methodology,^{28} the methodology can be seen as a support to analyze relevance at multiple granularity and multiple abstraction levels.
The advantage of the direct probabilistic semantics of the Bayesian statistical approach allows a mathematically direct and biomedically interpretable way to combine the results of data analysis with logical prior knowledge (for the aggregation of BNBMLA results at the SNP level to gene and pathway levels, see [36, 58, 59]). In addition to the propagation of the posteriors to upper levels by aggregation, it also allows the construction of Bayesian data analytic knowledgebases to support the fusion of weakly significant results from multiple data analyses.

Area under the receiver operating characteristic curve (AUC)
The AUC is a widely used measure of the ranking performance of a system with respect to a classification [29]. A plausible interpretation of the AUC can be explained as follows: if we select a random positive sample and a negative sample, the AUC is the probability that the method assigns a higher rank to the positive sample. The AUC value 0.5 corresponds to absolutely random selection; 1 corresponds to the perfect prediction.

Bayesian model averaging (BMA)
The posterior distribution over Bayesian network structures given N observations $\mathbb{P}(G{D}_{N})$ allows averaging in prediction and in exploration of model properties. In the latter case, we can calculate the posterior probability of any structural feature $F({V}^{\prime},G)=f$ by summing the posterior probability of models with property (feature) f (see equation (13.7) in Section 13.3, page 328). In fact, if the structural feature $F({V}^{\prime},G)$ is interpreted as a discrete transformation (mapping) $t(x):X\to Y$, then this corresponds to the inducement of a distribution over set Y from the distribution over X, i.e., transformation of the posterior distribution over the model space $\mathbb{P}(G{D}_{N})$ into a posterior distribution $\mathbb{P}(F{D}_{N})$ over simpler (less numerous) feature space (see equations (13.2) and (13.7) in Section 13.3, page 328).

A model of a given set of random variables is said to be causal when edges between two nodes are present, if and only if the parent is a direct cause of the child (for in depth treatment see [42, 51]).

Contextual independence
Contextual independence is a specialized form of conditional independence, when the conditional independence of two variable sets A and B conditioned on a variable set Z is valid only for a certain value c (i.e., the context) of another disjoint set C.

dseparation
dseparation is a graphical test of independence between variables in a Bayesian network structure. Two sets of variables A and B are dseparated by a third set Z if and only if any path between A and B is blocked by a variable in Z. A path is blocked by a variable z if it is part of the path with nonconverging edges, or if neither z nor any of its descendants is part of the path with converging edges. In Bayesian networks, dseparation implies conditional independence.

Domain
In the context of Bayesian networks, the domain is the segment of the world that a Bayesian network tries to model. A Bayesian network is said to represent a given domain if it models all the necessary knowledge of the domain. Represented/modeled entities of the domain appear as nodes in the Bayesian network.

Feature subset selection
(FSS) FSS is the task of selecting an optimal subset of features/properties from a larger set according to some external optimality criterion. A good example is to select a subset of tests to perform so that one achieves optimal cost—information gain balance. FSS is often viewed as a search problem in a space of feature subsets. To carry out this search, one must specify a starting point, a strategy to traverse the space of subsets, an evaluation function (induction algorithm), and a stopping criterion.

Filter approach
In the feature subset selection problem, the filter approach selects features using a preprocessing step that relies solely on properties of the data without taking into account the induction algorithm itself. This approach may lead to feature selections which in practice will perform poorly when evaluated by the induction algorithm. An example is to use the Pearson product–moment correlation coefficient to filter the predictors.

Independence map
A directed acyclic graph G is an independence map of a distribution if the all Markov assumptions (independence relations) implied by G are satisfied by the distribution.

Indicator function
An indicator function defined over a set X indicates membership of an element in a subset A of X, assigning the value 1 for all elements of A and the value 0 for all elements of X not in A.

Induction algorithm
In the feature subset selection problem, the induction algorithm is used for evaluating a subset of the features.

Loss function
In the context of decision theory, a loss function assigns a real number to the possible outcomes, reflecting one’s preferences over these outcomes. Contrary to a utility function, greater value of a loss function represents dispreference of the corresponding outcome.

L1 and L2 are standard distances between two vectors. The L^{f} distance of two vectors x and y is computed as ${(\sum _{i}{x}_{i}{y}_{i}{}^{p})}^{\frac{1}{p}}$; L1 is the Manhattan distance, and L2 is the Euclidean distance.

Markov compatibility
A joint probability distribution $\mathbb{P}$ and a DAG G (over the random variables in $\mathbb{P}$) are Markov compatible if each variable is conditionally independent of the set of all its nondescendants, given the set of all its parents.

Modular feature
A mapping f from graph structures onto 0, 1 is modular if $f(G)=\prod _{i=1}^{n}{f}_{i}({G}_{i})$, where each f_{i} is a mapping from the subsets of the complementary set $V\setminus {X}_{i}$ onto 0, 1, i.e., f is modular if the graph structures with f=1 can be defined as the free combinations of substructures with ${f}_{i}=1$ for $i=1,\dots ,n$. For example, the indicator of a directed edge between two nodes is modular, because the graphs with this edge can be defined as two indicators for these two nodes selecting the legal parental sets. Furthermore, the indicator of any subgraph is also modular, because the parental sets covering the subgraph can be defined independently.

Pairwise association
Pairwise association is a form of statistical association between two random variables. In genetic association studies, the strength of association is measured between a single nucleotide polymorphism and trait status, based on samples that are informative for the trait of interest.

Observational equivalence class
Two directed acyclic graphs (DAGs) are observationally equivalent if they imply the same set of independence relations. Observational equivalence is a transitive relation; the set of DAGs observationally equivalent to each other is called an observational equivalence class.

qvalue
The qvalue is the false discovery rate (FDR) analogue of the pvalue. The qvalue of an individual hypothesis test is the minimum FDR at which the test may be called significant.

Sample complexity
Sample complexity is the minimum number of samples a learning algorithm needs for choosing a hypothesis whose error is smaller than a given threshold (\varepsilon) with at least a given (1−δ) probability.

Stable distribution
A distribution is said to be stable if there exists a directed acyclic graph that exactly represents the dependences and independences entailed by the distribution.

Strong relevance
A feature (attribute, variable) X_{i} is strongly relevant to Y if there exists some ${X}_{i}={x}_{i},Y=y$ and ${s}_{i}={x}_{1},\dots ,{x}_{i1},{x}_{i+1},\dots ,{x}_{n}$ for which $\mathbb{P}({x}_{i},{s}_{i})>0$ such that $\mathbb{P}(y{x}_{i},{s}_{i})\ne \mathbb{P}(y{s}_{i})$. Strong relevance of a variable with respect to a given target variable implies that the variable is indispensable, in the sense that sometimes it is essential for the prediction of the target.

Utility function
In the context of decision theory, a utility function assigns a real number to the possible outcomes, reflecting one’s preferences over these outcomes. In the context of reporting a given set of strongly relevant variables, a utility function declares how good the given set is.

A feature X_{i} is weakly relevant to Y if it is not strongly relevant, and there exists a subset of features ${S}_{i}^{\prime}$ of S_{i} for which there exists some ${x}_{i},y$, and ${s}_{i}^{\prime}$ for which $\mathbb{P}({x}_{i},{s}_{i}^{\prime})>0$ such that $\mathbb{P}(y{x}_{i},{s}_{i}^{\prime})\ne \mathbb{P}(y{s}_{i}^{\prime})$. Weak relevance of a variable with respect to a given target variable implies that the variable can sometimes improve the prediction accuracy of the target.

Wrapper approach
In the feature subset selection problem, the wrapper approach means that the selection algorithm searches for a good subset using the induction algorithm itself (as a black box) as part of the function evaluating feature subsets. The feature subset with the highest evaluation is chosen as the final set on which to run the induction algorithm. Examples include standard likelihoodbased forward selection and backward elimination methods in linear regression and logistic regression.
References
Bibliography references:
[1] C.F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. Koutsoukos. Local causal and Markov blanket induction for causal discovery and feature selection for classification. Journal of Machine Learning Research, 11:171–284, 2010.
[2] P. Antal, G. Hajós, and P. Sárközy. Bayesian network based analysis in sequential partial genome screening studies. In MODGRAPH 2009 Probabilistic Graphical Models for Integration of Complex Data and Discovery of Causal Models in Biology, Satellite Meeting of JOBIM 2009, 2009.
[3] P. Antal, G. Hullám, A. Gézsi, and A. Millinghoffer. Learning complex Bayesian network features for classification. In Probabilistic Graphical Models, pages 9–16, 2006.
[4] P. Antal, A. Millinghoffer, G. Hullám, C. Szalai, and A. Falus. A Bayesian view of challenges in feature selection: feature aggregation, multiple targets, redundancy and interaction. Journal of Machine Learning Research – Proceedings Track, 4:74–89, 2008.
[5] D.J. Balding, M. Bishop, and C. Cannings. Handbook of Statistical Genetics, 2007.
[6] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Contextspecific independence in Bayesian networks. In E. Horvitz and F. Verner Jensen, editors, Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI 96), pages 115–123. Morgan Kaufmann Publishers, 1996.
[7] W.L. Buntine. Theory refinement of Bayesian networks. In B. D’Ambrosio and P. Smets, editors, Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence (UAI 91), pages 52–60. Morgan Kaufmann Publishers, 1991.
[8] M. Woodward. Epidemiology: Study Design and Data Analysis. Chapman and Hall, 1999.
[9] Shao Q.Ibrahim J.G.Chen, M. Monte Carlo Methods in Bayesian Computation. SpringerVerlag, 2000.
[10] G.F. Cooper, C.F. Aliferis, R. Ambrosino, J. Aronis, B. G. Buchanan, R. Caruana, M. J. Fine, C. Glymour, G. Gordon, B. H. Hanusa, J. E. Janosky, C. Meek, T. Mitchell, T. Richardson, and P. Spirtes. An evaluation of machinelearning methods for predicting pneumonia mortality. Artificial Intelligence in Medicine, 9:107–138, 1997.
[11] G. Cooper. A simple constraintbased algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 2:203–224, 1997.
[12] G.F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.
[13] G.F. Cooper and C. Yoo. Causal discovery from a mixture of experimental and observational data. In K. B. Laskey and H. Prade, editors, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI 99), pages 116–125. Morgan Kaufmann Publishers, 1999.
[14] H.J. Cordell. Detecting genegene interactions that underlie human diseases. Nature Reviews Genetics, 10:392–404, 2009.
[15] (p.358) H.J. Cordell and D.G. Clayton. Genetic association studies. Lancet, 366:1121–1131, 2005.
[16] R. Culverhouse, B.K. Suarez, J. Lin, and T. Reich. A perspective on epistasis: limits of models displaying no main effect. American Journal of Human Genetics, 70:461–471, 2002.
[17] O. Davidovich, G. Kimmel, and R. Shamir. Gevalt: an integrated software tool for genotype analysis. BMC Bioinformatics, 8:2105–2112, 2007.
[18] Holmes C.C.Mallick B.K.Smith A.F.M.Denison, D.G.T. Bayesian Methods for Nonlinear Classification and Regression. Wiley & Sons, 2002.
[19] B.I. Fridley. Bayesian variable and model selection methods for genetic association studies. Genetic Epidemiology, 33:27–37, 2009.
[20] N. Friedman, M. Goldszmidt, and A. Wyner. On the application of the bootstrap for computing confidence measures on features of induced Bayesian networks. In 7th International Workshop on Artificial Intelligence and Statistics, pages 197–202, 1999.
[21] N. Friedman and D. Koller. Being Bayesian about network structure. In C. Boutilier and M. Goldszmidt, editors, Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence(UAI2000), pages 201–211. Morgan Kaufmann Publishers, 2000.
[22] N. Friedman and D. Koller. Being Bayesian about network structure. Machine Learning, (50):95–125, 2003.
[23] N. Friedman and Z. Yakhini. On the sample complexity of learning Bayesian networks. In E. Horvitz and F. Verner Jensen, editors, Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI 96), pages 274–282. Morgan Kaufmann Publishers, 1996.
[24] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman & Hall, 1995.
[25] P. Giudici and R. Castelo. Improving Markov chain Monte Carlo model search for data mining. Machine Learning, (50):127–158, 2003.
[26] Cooper G.F.Glymour, C. Computation, Causation, and Discovery. AAAI Press, 1999.
[27] S. Gu, A.J. Pakstis, and K.K. Kidd. Haplot: a graphical comparison of haplotype blocks, tagSNP sets and SNP variation for multiple populations. Bioinformatics, 21(20):3938–3939, 2005.
[28] B. Han, M. Park, and X. Chen. A Markov blanketbased method for detecting causal SNPs in GWAS. BMC Bioinformatics, 11:5, 2010.
[29] D.J. Hand. Construction and Assessment of Classification Rules. Wiley & Sons, 1997.
[30] D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, (20):197–243, 1995.
[31] X. Jiang, M.M. Barmada, and S. Visweswaran. Identifying genetic interaction in genomewide data using Bayesian networks. Genetic Epidemiology, (34):575–581, 2010.
[32] R. Kohavi and G.H. John. Wrappers for feature subset selection. Artificial Intelligence, 97:273–324, 1997.
[33] M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, (5):549–573, 2004.
[34] D. Koller and M. Sahami. Toward optimal feature selection. In Thirteenth International Conference on Machine Learning, pages 284–292. Morgan Kaufmann Publishers, 1996.
[35] C. Kooperberg and I. Ruczinski. Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology, 28:157–170, 2005.
[36] O. LautnerCsorba et al. Candidate gene association study in pediatric acute lymphoblastic leukemia evaluated by Bayesian network based Bayesian multilevel analysis of relevance. BMC Medical Genomics, 5:42, 2012.
[37] D. Madigan, S.A. Andersson, M. Perlman, and C.T. Volinsky. Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs. Communications in Statistics: Theory and Methods, (25):2493–2520, 1996.
[38] C. Meek. Causal inference and causal explanation with background knowledge. In P. Besnard and S. Hanks, editors, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI 95), pages 403–410. Morgan Kaufmann Publishers, 1995.
[39] (p.359) J.H. Moore, J.C. Gilbert, C.T. Tsai, F.T. Chiang, T. Holden, N. Barney, and B.C. White. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology, 241:252–261, 2006.
[40] M.Y. Park and T. Hastie. Penalized logistic regression for detecting gene interactions. Biostatistics, (9):30–50, 2007.
[41] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
[42] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
[43] D. Pe’er, A. Regev, G. Elidan, and N. Friedman. Inferring subnetworks from perturbed expression profiles. Bioinformatics, (17):215–224, 2001.
[44] G. Peng, L. Luo, H. Siu, Y. Zhu, P. Hu, S. Hong, J. Zhao, X. Zhou, J.D. Reveille, L. Jin, C.I. Amos, and M. Xiong. Gene and pathwaybased secondwave analysis of genomewide association studies. European Journal of Human Genetics, 18:111–117, 2010.
[45] P.C. Phillips. Epistasis  the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics, 9:855–867, 2008.
[46] P.N. Robinson and S. Mundlos. The human phenotype ontology. Clinical Genetics, 77:525–534, 2010.
[47] I. Ruczinski, C. Kooperberg, and M. LeBlanc. Logic regression. Journal of Computational and Graphical Statistics, 12:475–511, 2003.
[48] Y. Saeys, I. Inza, and P. Larra naga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23:2507–2517, 2007.
[49] H. Schwender and K. Ickstadt. Identification of SNP interactions using logic regression. Biostatistics, (9):187–198, 2008.
[50] T. Silander, P. Kontkanen, and P. Myllymäki. On sensitivity of the MAP Bayesian network structure to the equivalent sample size parameter. In R. Parr and L. C. van der Gaag, editors, Proceedings of the Twentythird Conference on Uncertainty in Artificial Intelligence (UAI07), pages 360–367. AUAI Press, 2007.
[51] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2001.
[52] M. Stephens and D.J. Balding. Bayesian statistical methods for genetic association studies. Nature Review Genetics, 10(10):681–690, 2009.
[53] M. Stephens and P. Donnelly. A comparison of Bayesian methods for haplotype reconstruction from population genotype data. The American Journal of Human Genetics, 73(5):1162–1169, 2003.
[54] J.D. Storey and R. Tibshirani. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America, 100(16):9440–9445, 2003.
[55] S.J. Swamidass, C. Azencott, K. Daily, and P. Baldi. A croc stronger than roc: measuring, visualizing and optimizing early retrieval. Bioinformatics, 26:1348–1356, 2010.
[56] I. Tsamardinos and C. Aliferis. Towards principled feature selection: relevancy, filters, and wrappers. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Morgan Kaufmann Publishers, 2003.
[57] M. Ueno. Learning networks determined by the ratio of prior and data. In P. Grünwald and P. Spirtes, editors, Proceedings of the TwentySixth Conference on Uncertainty in Artificial Intelligence (UAI 10), pages 598–605. AUAI Press, 2010.
[58] I. Ungvári, G. Hullám, P. Antal, P.S. Kiszel, A. Gézsi, É. Hadadi, V. Virág, G. Hajós, A. Millinghoffer, A. Nagy, A. Kiss, Á.F. Semsei, G. Temesi, B. Melegh, P. Kisfali, M. Széll, A. Bikov, G. Gálffy, L. Tamási, A. Falus, and C. Szalai. Evaluation of a partial genome screening of two asthma susceptibility regions using Bayesian network based Bayesian multilevel analysis of relevance. PLOS ONE, (7):e33573, 2012.
[59] G. Varga, A. Szekely, P. Antal, P. Sárközy, Z. Nemoda, Z. Demetrovics, and M. SasvariSzekely. Additive effects of serotonergic and dopaminergic polymorphisms on trait impulsivity. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 159:281–288, 2012.
[60] R.E. Wyllys. Empirical and theoretical bases of Zipf’s law. Library Trends, (30):53–64, 1981.
[61] (p.360) H. Xing, P.D. McDonagh, J. Bienkowska, T. Cashorali, K. Runge, R.E. Miller, D. DeCaprio, B. Church, R. Roubenoff, I.G. Khalil, and J. Carulli. Causal modeling using network ensemble simulations of genetic and gene expression data predicts genes involved in rheumatoid arthritis. PLOS Computational Biology, 7(3):e1001105, 2011.
[62] C. Yoo and G. Cooper. An evaluation of a system that recommends microarray experiments to perform to discover generegulation pathways. Artificial Intelligence in Medicine, (31):169–182, 2004.
[63] L. Yu and H. Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5:1205–1224, 2004.
[64] J. Zhang, W.L. Rowe, J.P. Struewing, and K.H. Buetow. HapScope: a software system for automated and visual analysis of functionally annotated haplotypes. Nucleic Acids Research, 30(23):5213–5221, 2002.
[65] Y. Zhang and J.S. Liu. Bayesian inference of epistatic interactions in casecontrol studies. Nature Genetics, 39:1167–1173, 2007.
Notes:
(^{2}) For a brief definition of loss functions in the feature subset selection context, see Appendix 13.A (page 354).
(^{6}) $\mathbb{P}$ is Markov compatible with respect to G if $\mathbb{P}$ can be factorized by G [42].
(^{7}) For a brief definition of observational equivalence classes of BNs, see Appendix 13.A (page 354).
(^{10}) Note that the present definition of confounded relevance covers only a partial aspect of confounding [8].
(^{11}) The interactionist relevance (IR) representation of a contextually relevant factor X_{i} for target Y using variable C as a common child is incorrect, because it implies that the effect of variable X_{i} on Y related to variable C is not present if we do not know C. This typically arises if the X_{i} and C variables in the interaction are dependent and X_{i} has negligible main effect on target Y. However, the IR representation is correct in the special case of a purely epistatic factor, which has no main effect [16].
(^{12}) In a similar vein, the local causal discovery methods apply wellselected local tests instead of using global scores or interventional data, e.g., see [11, 13].
(^{17}) The goal of a posthoc analysis is to search the primary, “raw” results (in our case the MBS posteriors) for patterns that are not expected a priori or that are too numerous. In our case, such patterns are the ksubMBS and the ksupMBS posteriors, that would be redundant and intractable if directly sampled during the MCMC run.
(^{18}) Confidence intervals for MCMC estimation can be calculated e.g. according to the batch means method [9].
(^{19}) Model complexity also depends on the maximum number of parents and the local models used in the BN (A local model defines the conditional distribution for a variable given its parents).
(^{20}) Oddsratios are standard measures of the effect size of the binary predictor X on binary target Y. The oddsratio is defined as the ratio of the odds of Y in the case when X=0 and X=1, respectively: $\left(\mathbb{P}(Y=1X=1)/\mathbb{P}(Y=0X=1)\right)/\left(\mathbb{P}(Y=1X=0)/\mathbb{P}(Y=0X=0)\right)$.
(^{21}) The entropy H(X) of a discrete random variable X with probability mass function P(X) is the expected value of the negative logarithm of P(X), that is: $H(X)=E[ln(P(X))]$. An important consequence of this definition is that a uniformly distributed random variable (i.e., whose distribution is completely “flat”) has maximal entropy. Note that the entropy depends on the cardinality, e.g., in case of uniform distributions, entropy is higher for the MBG than for the MBS level. Also note that the MBM level consists of separate entropies for each $MBM(Y,{X}_{i},G)$, which are summed together to provide an aggregate measure (this summation corresponds to the simplification that $MBM(Y,{X}_{i},G)$ and $MBM(Y,{X}_{j},G)$ are mutually independent).
(^{25}) Consistency expresses the lack of contradiction, e.g., that the set does not contain variables from competing, alternative theories.
(^{26}) Completeness expresses the lack of omission, e.g., that the set contains all the variables corresponding to a given theory.
(^{28}) Note that the BNBMLA methodology is neutral for the selection of target variables and that the MBM relation is symmetric. Consequently, a genetic variable or small group of genetic variables can also be the target to explore their phenotypic relevance in case of rich phenotypic descriptors.