Bayesian Networks in the Study of Genomewide DNA Methylation
Bayesian Networks in the Study of Genomewide DNA Methylation
Abstract and Keywords
This chapter explores the use of Bayesian networks in the study of genomescale deoxyribonucleic acid (DNA) methylation. It begins by describing different experimental methods for the genomescale annotation of DNA methylation. The Methylseq protocol is detailed and the biases induced by this technique are depicted, which constitute as many challenges for further analysis. These challenges are addressed introducing a Bayesian network framework for the analysis of Methylseq data. This previous model is extended to incorporate more information from the genomic sequence. Genomic structure is used as a prior on methylation status. A recurring theme is the interplay between the model used to glean information from the technology, and the view of methylation that drives the model specification. Finally, a study is described, in which such models were used, leading to both interesting biological conclusions and to insights about the nature of methylation.
Keywords: DNA methylation, Methylseq, Bayesian networks
Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics. First Edition. Christine Sinoquet & Raphaël Mourad (Eds).© Oxford University Press 2014. Published in 2014 by Oxford University Press.
This chapter explores the use of Bayesian networks in the study of genomescale DNA methylation. It begins by describing different experimental methods for the genomescale annotation of DNA methylation. The methylSeq method is detailed, and the biases induced by this technique, which constitute as many challenges for further analysis are depicted. These challenges are addressed introducing a Bayesian network framework for the analysis of methylSeq data. This previous model is extended to incorporate more information from the genomic sequence. Genomic structure is used as a prior on methylation status: unmethylated sites tend to cluster, and unmethylated sites are more conserved than other sites. A recurring theme is the interplay between the model used to glean information from the technology, and the view of methylation that drives the model specification. Finally, a study is described in which such models were used, leading to both interesting biological conclusions and to insights about the nature of methylation.
(p.364) 14.1 Introduction to Epigenetics
Epigenetic mechanisms influence phenotype through heritable, but potentially reversible, regulation of gene expression. These mechanisms work at many levels including DNA methylation, histone modifications, nucleosome positioning, and replication timing [9]. All of these have been shown to be heritable across cell divisions and to differ across different tissues and cell types [14, 18, 19, 21, 43, 48]. There is a significant body of ongoing research on each of these different mechanisms, and in this section we summarize two of the most studied epigenetic features: DNA methylation and histone modifications.
DNA methylation DNA methylation relates to the covalent attachment of a methyl group to a cytosine (C) nucleotide, in place of the hydrogen atom on the 5 position on the pyrimidine ring, or to the addition of a methyl group to an adenine (A). Adenine methylation has been found only in bacterial genomes [18], and the term “DNA methylation” usually, and throughout this chapter, refers to cytosine methylation. Major events in the cell, such as cell differentiation, Xchromosome inactivation and retrotransposon silencing to name a few, are characterized by DNA methylation, and in mice it has been shown that inhibition of the methylation regulating enzymes results in embryonic lethality [33, 55]. Broadly speaking, DNA methylation is associated with a heterochromatin state and silencing of transcription. In vertebrates, DNA methylation is restricted mostly to CpG sites (Cs that are followed by guanines (Gs)). Most of the vertebrate genome is methylated, and the unmethylated sites tend to cluster together along the genome [4, 55]. Many such unmethylated clusters occur at promoter regions, and their methylation is associated with the silencing of the nearby gene [60].
Histone modification The Nterminal tails of histones are subject to various types of posttranslational covalent modifications, including lysine and arginine methylation, lysine acetylation, ubiquitination, and serine phosphorylation [30]. A histone may harbor a number of different modifications at a given time, giving rise to many possible configurations of modifications, sometimes related to as a histone code [52, 57]. Histone modifications can influence the packing assembly of the DNA by moderating a histone’s DNAbinding affinity and by recruiting further chromatin remodeling complexes [30]. By doing so, these modifications can affect gene expression; some modifications are associated with transcriptional repression, and others are associated with transcriptional activation. Both types of modification can be present at either promoter or intragenic regions [8].
Epigenetic phenomena can affect gene regulation and be inherited between cell divisions through a mechanism different than that by which they were initiated. This results in an ability to maintain a regulatory program across cell divisions, enabling a form of cell “memory.” Epigenetics is therefore at the forefront of cell differentiation studies [10, 27, 39], and numerous epigenetic mutations have been associated with various diseases [46]. Specifically, many studies have reported association between altered methylation states and various cancers [3, 13, 28].
Another active field of research concerns deciphering the affects of environmental factors on epigenetic states, along with the extent to which changes in epigenetic states are stochastic. For example, it has been shown that monozygotic twins accumulate differences in DNA methylation throughout their lives, and that the accumulated differences affect their gene expression portrait [15]. Another interesting study has showed that prenatal tobacco smoke exposure affects DNA methylation in the fetus [7].
(p.365) Transgenerational inheritance of DNA methylation has been observed in several loci in mice [6, 41, 44]. In plants, changes in DNA methylation that are inherited across generations have been shown to be rather frequent and seem to be a common way of adapting gene regulation to a changing environment [20, 23, 47]. The implications of these phenomena for theories of inheritance and evolution are currently being explored [51].
The genesis of highthroughput sequencing technologies (also referred to as “nextgeneration” sequencing) has enabled the study of epigenetic features on a genomewide scale, giving rise to new fields of study. Genomewide data sets allow for the characterization of epigenetic phenomena across all known genes and allow for broad comparison studies between genes, tissues, individuals, and species [26, 27, 35, 36, 62]. The presence of genomescale epigenetic data enables association studies to look at both genomic and epigenomic features in the search for causal variants of disease, and the development of methods for epigenomewide association studies is underway [45].
In this chapter, we explore the use of Bayesian networks in the study of genomescale DNA methylation. We begin by describing different experimental methods for the genomescale annotation of DNA methylation in Section 14.2 and focus on the challenges present in the analysis of a specific method, called methylSeq. In Section 14.3, we introduce a Bayesian network for the analysis of methylSeq data, and in Section 14.4, we extend this model to incorporate more information from the genomic sequence. A recurring theme is the interplay between the model used to glean information from the technology, and the view of methylation that drives the model specification. We describe a study in which such models were used in Section 14.5, and show in Section 14.6 how the models led to both interesting biological conclusions and to insights about the nature of methylation.
14.2 Nextgeneration Sequencing and DNA Methylation
The transformative impact of highthroughput sequencing on many fields of biology cannot be overstated. The ability to sequence many short DNA fragments at a low cost per base pair (bp), motivated initially by the goal of lowering the costs of generating the DNA sequence of an individual or species, has found unexpected applications in molecular biology. One of the fields in which sequencing technologies have initiated a revolution over the past few years is epigenetics, and specifically DNA methylation. In this section, we describe the different ways in which highthroughput sequencing is used to study DNA methylation, and explain why there is a relatively large number of methods in use today. Throughout this section we focus on the possibilities and challenges of studying DNA methylation through the use of highthroughput sequencing. We will first outline the different techniques that enable measurement of DNA methylation on a genomewide scale, and then describe the main methods that make use of these techniques, explaining in detail the methylSeq protocol, which is the focus of this chapter.
There are several ways to detect DNA methylation, and a detailed description of the different techniques can be found in [31]. The major techniques currently available are illustrated in Fig. 14.1 (page 14.2.1) and include:
Enzyme digestion: This technique uses restriction enzymes that in addition to being sequencespecific are methylation sensitive (cut only if their recognition site is unmethylated). A notable example is HpaII, which digests at CCGG sites at which the second cytosine is unmethylated (p.366) (in this chapter we denote a C that is followed by a guanine (G) as CpG, and, to ease notation, denote a pair of Cs followed by a pair of Gs as CCGG, rather than CpCpGpG). The pattern of digestion by such enzymes can provide a readout of the DNA methylation. A disadvantage of this technique is the possibility of incomplete digestion and for some enzymes, the lack of sitespecific resolution.

Sodium bisulfite conversion: Treatment of DNA molecules with sodium bisulfite converts unmethylated Cs to uracils [24, 59]. Followed by a PCR step, the uracils are sequenced as thymines (Ts). This procedure enables the conversion of an epigenetic mark into a modification in the genomic sequence. Disadvantages of this technique include a reduction in sequence complexity (see details below), the need to chemically treat the DNA which can result in degradation, and the possibility of incomplete conversion.

Affinity enrichment: In this process, the genome is digested, and the digest is enriched for methylated regions. This is achieved through the recognition of methylated regions by either antibodies [40] or methylCpG binding domain proteins [63] and the separation of DNA regions that were bound from those that were not. Disadvantages of this technique include binding biases of the enzymes and the lack of sitespecific resolution.
14.2.1 Assaying Genomewide DNA Methylation
Methods for genomewide measurement of DNA methylation generally use one of the aforementioned techniques coupled with either highthroughput sequencing or array hybridization. Potentially, methods may incorporate more than one of the techniques, but current approaches do not do so. These methods can be broadly classified as methyltyping versus methylome (p.367) sequencing, in analogy with genotyping versus genome sequencing for DNA. In genotyping, only a small subset of an individual’s nucleotides are assayed (SNP locations), while in genome sequencing the whole genome of an individual is sequenced. The advantage of genotyping is in its significantly lower cost compared to whole genome sequencing, allowing for the inclusion of many more individuals in an experiment of a fixed cost. Similarly, methyltyping technologies allow surveying of genomescale methylation patterns by sampling a subset of CpG sites, and emphasize low cost at the expense of high resolution. In this section, we discuss different methods for methylome sequencing and methyltyping, along with the advantages and disadvantages associated with each method.
The different methods used for highthroughput genomics, including those for measuring DNA methylation, reduce to the ability of counting DNA fragments in a digest, obtained using either arrays or sequencing. Arrays are considered less accurate than sequencing for quantification purposes, mainly due to biases introduced in arrays by the variability in hybridization. Each of the techniques described can be followed by array hybridization, but affinity enrichment is by far the technique that is best suited for arrays [31]. The advantage of using arrays is the low cost, but disadvantages include the biases introduced in the hybridization step, as well as genomescale arraybased methods not being sitespecific (there are arraybased methods that use either enzyme digestion or bisulfite conversion and are sitespecific, but since they sample a considerably smaller number of CpGs, we do not consider them as genomescale methyltyping methods).
Advances in highthroughput sequencing in the past several years have brought about a large increase in the number of available methods for mapping DNA methylation on a genomewide scale. Wholegenome bisulfite sequencing (WGBS) involves random digestion of the genome followed by bisulfite treatment, amplification, and sequencing, offering the ability to measure absolute levels of DNA methylation at a singlenucleotide resolution. While this procedure has been used in different studies [11, 34, 35, 62], its use is limited as it is the most expensive method for DNA methylation annotation, because it requires the sequencing of whole genomes. Moreover, sequencing a whole methylome is considerably more expensive than sequencing a whole genome because of the continuous nature of the methylation phenomenon (we would like to determine the proportion of cells in the digest in which a cytosine was methylated), requiring significantly higher coverage than genome sequencing. Therefore, this method cannot, in the foreseeable future, be used for largescale comparison studies such as population studies and epigenetic association studies. Other disadvantages of the method that are not present in wholegenome sequencing include biases introduced in the polymerase chain reaction (PCR) amplification step and a reduction in sequence complexity. The PCR amplification step introduces biases that are due to unmethylated instances introducing ATrich sequences. The sequence complexity is reduced because every T sequenced in a read could be mapped to either a T or a C in the reference genome, reducing the number of locations producing uniquely mapping reads. A second method for annotating methylation across the whole genome is that of affinity enrichment followed by highthroughput sequencing. This method has been used in [37], but requires extensive sequencing, is not sitespecific, and is prone to binding biases.
The use of highthroughput sequencing has enabled several methods for methyltyping, two of which are discussed here: methylSeq and Reduced representation bisulfite sequencing (RRBS). In methylSeq the genome is digested with the methylation sensitive restriction enzyme HpaII that digests unmethylated CpGs that are within CCGG sites. This is followed by size selection of the fragments, a PCR amplification step, and sequencing the fragments’ ends. Mapping (p.368) sequenced reads back to a reference genome reveals unmethylated CpGs. methylSeq is a convenient methyltyping strategy because it is costeffective due to the sequencing of only unmethylated sites (the minority of CpG sites in vertebrates [55]), requires only small amounts of material, and avoids bisulfite conversion. However, although the experiment is relatively simple, interpretation of the sequencing data is not straightforward due to the protocol resulting in a nonrandom segmentation of the genome which is followed by a size selection step. We describe the methylSeq protocol and the bias that is associated with it in greater detail in the next section.
A second methyltyping method, RRBS, is based on digestion with a methylationinsensitive enzyme followed by bisulfite sequencing [38]. The procedure is enzyme digestion followed by size selection of the fragments, bisulfite treatment, PCR amplification, and sequencing of the fragments’ ends. The digestion is commonly performed with MspI [22, 39], which digests CCGG sites, to enrich for regions of the genome that are rich in CpG sites. RRBS is favorable due to being significantly less expensive to perform than wholegenome bisulfite sequencing, but suffers from the same disadvantages related to methods that use bisulfite conversion: bias introduced by the PCR amplification step and reduced sequence complexity.
When designing an experiment in which DNA methylation is measured on a genomewide scale, one must consider the tradeoff between the cost of the method and the type of coverage it produces. In addition to this, the application and analysis of the different methods are complicated by a number of other issues. The major complications of the different methods based on the technologies they use were discussed above. On top of this, the analysis requirements for different assays vary in difficulty, and the rapid development of the sequencing field calls for frequent reassessment of the comparison between methods. For these reasons, there has been a proliferation of methods whose pros and cons are constantly changing as sequencing technologies change. In [31], it was suggested that methylSeq is the method with the most favorable profile of pros and cons, with respect to the measures chosen for comparison (see Table 2 of [31]). We compare here the different methods presented, given the current stage of the field, but the reader should keep in mind that the comparison criteria will change as the field develops and as new techniques are introduced.
Table 14.1 summarizes the main features by which methods are compared. It is divided into methyltyping methods and wholegenome methods. We have omitted a column for a cost comparison due to the changing nature of the field (sequencing costs are rapidly decreasing), but it seems certain that in the foreseeable future, wholegenome methods will require significantly more sequencing than methylSeq and RRBS and that RRBS will require more sequencing than methylSeq. Methylseq retrieves information spanning more of the genome than RRBS, because of a more favorable profile of fragment sizes produced by HpaII relative to MspI [49].
In the next section, we explain in detail the methylSeq experimental procedure and the need for a statistical model dedicated to the analysis of the data produced, motivating the subsequent sections in the chapter.
14.2.2 The methylSeq Method
In this section, we describe the methylSeq experiment and the experimental bias it introduces. In the methylSeq experiment, the genomic DNA is digested with a methylationsensitive restriction enzyme; in our case, HpaII. HpaII digests at CCGG sites in which the second C is unmethylated. This step obtains a digest of DNA fragments such that at all of the fragment ends there is an unmethylated HpaII site, and all HpaII sites within a fragment (not including the ends) are (p.369) methylated (assuming complete digestion). Then, the fragments are sizeselected using a run on an agarose gel, where the common length accepted is between 50 and 300 bps. This size selection is important to achieve good sequencing throughput when using Illumina machines. A pairedend library is constructed from the fragments that pass the sizeselection step and, after a PCR amplification step, the ends of the fragments are sequenced. The paired sequenced reads are then mapped to a reference genome using an aligner that supports pairedend reads, such as Bowtie [32]. Notice that by constructing a pairedend library, the sequencing experiment returns pairs of sequences, where each pair is the product of sequencing the ends of one fragment. Using this protocol one can conclude which fragments where present in the digest.
Table 14.1 A summary of the characteristics of commonly used methods for methyltyping (top three) and whole methylome annotation (bottom two) in humans. For further information on the derivation of the values in the table see [49] and [31].
Preselected 
Coverage of 
Coverage of 
# CpGs 
Analysis 


Sitespecific 
Regions 
Human Genome 
CpG Islands 
Sampled 
Challenges 

methylSeq 
yes 
no 
9.2% 
92.9 % 
~1.4 M 
inference procedure needed 
RRBS 
yes 
no 
8.1% 
69.8 % 
~1.4 M 
low complexity + PCR bias 
Affinitybased Array 
no 
yes 
preselected 
preselected 
 
binding biases 
WGBS 
yes 
no 
whole genome 
whole genome 
~28 M 
low complexity + PCR bias 
Affinitybased Seq 
no 
no 
whole genome 
whole genome 
~28 M 
binding biases + array biases 
Note: RRBS, reduced representation bisulfite sequencing; WGBS, wholegenome bisulfite sequencing.
After the digestion step has completed, all unmethylated HpaII sites are present at the ends of digested fragments (see Fig. 14.1), but the size selection step required by the sequencing protocol limits sequencing to fragments of a narrow size range. This results in many cases in which one cannot determine the extent to which a site is methylated based on its read counts alone. Fig. 14.2 (page 370) shows three different types of epialleles and the fragments generated by them that pass the size selection step and are sequenced. When determining the methylation state of the highlighted site, cases 2 and 3 are indistinguishable if one considers only fragments originating at that site: in both cases, there are no fragments sequenced that have the highlighted site at either end. However, in case 2 we see that the sites on both sides of the highlighted site are unmethylated, and therefore if the highlighted site had been methylated, we would have sequenced a fragment 60 bps in length corresponding to the fragment in the middle (as sequenced in case 3). Since such a fragment was not sequenced, we can conclude that the site is unmethylated. In case 3, since the highlighted site was present in the interior of a fragment, we can conclude that it was methylated. While the examples in this figure assume binary methylation states, when relating to a population (p.370) of cells, methylation states are assumed to be continuous variables, because the methylation states can be heterogeneous even within a population of a single cell type [1, 64]. It can easily be seen that the examples above can be generalized to the case of continuous variables, where one cannot determine the extent to which a site is methylated given the read counts at that site alone, but can do so when making use of the fragments generated from the site’s neighborhood.
14.3 A Bayesian network for methylSeq Analysis
We have seen in the previous section that while methylSeq is a favorable method for methyltyping, the bias present in the method needs to be corrected for. When performing a methylSeq experiment, we would like to gain knowledge of the extent to which each HpaII site is methylated. More precisely, we would like to know for each HpaII site the proportion of cells in the digest that were methylated at that site. What we observe, however, is the number of times each fragment was sequenced in the experiment. Using our knowledge of the experimental procedure, we take a generative approach to model the procedure by which the methylation states of the HpaII sites impact the number of times each fragment is sequenced. On the basis of this generative model, we can infer the expected methylation states of the HpaII sites, given the observed fragment counts.
In the next section, we discuss how the methylSeq experiment can be modeled as a Bayesian network, and how that Bayesian network can be used to infer the methylation extents for all HpaII sites given the fragment counts. We begin by introducing a generative model for our data that is based on a Bayesian network. We then describe how the model parameters can be learned using the expectationmaximization (EM) algorithm, and how we use this together with the observed fragment counts to infer the methylation states of the HpaII sites. In the following section, we (p.371) discuss how expanding the model by adding an additional set of variables can achieve better sitespecific inferences, as well as infer the location of unmethylated clusters.
14.3.1 Notation
In this chapter, we will denote random variables by uppercase letters, and a specific assignment to a random variable by the corresponding lowercase letter. For example, three coin tosses can be modeled by the random variables X_{1},X_{2}, and X_{3} for the first, second, and third toss, respectively, and an assignment to X_{1} is denoted by x_{1}. We will denote sets of random variables by bold uppercase letters, and an assignment to the set of random variables by the corresponding bold lowercase letter. For example, we would denote by X the three random variables X_{1},X_{2}, and X_{3}, and by x an assignment to all three variables.
14.3.2 A Generative Model
In this section, we describe a generative model for the methylSeq experiment. Given a reference genome, let F be the set of genomic sequences that have a HpaII site at each end. Notice that these regions may also have HpaII sites within them. We note that although theoretically F is the set of fragments that may be present in the digest of a methylSeq experiment, some fragments have probability of being sequenced which is essentially zero, like fragments spanning whole chromosomes, but we disregard that now to ease notation. For every HpaII site, we assign a random variable $\{{Y}_{i}{\}}_{i=1,\dots ,N}\in [0,1]$, where N is the number of HpaII sites in the reference genome. Y_{i} represents the proportion of cells from the digest in which site i was methylated.
In our generative model, we assume that at each step a cell is drawn at random, and the methylation state of each HpaII site is determined as 1 (methylated) or 0 (unmethylated) using Bernoulli draws with probability y_{i} (in the case of diploid genomes, this sampling is done twice, once for each of the chromosomes). The methylation assignment to all HpaII sites of the chromosome determines what fragments will be generated from this cell. Let $\{{Z}_{i}{\}}_{i=1,\dots ,F}\in [0,1]$ be random variables for the proportion of instances from which fragment f_{i} was generated. In other words, let $\{{Z}_{i}{\}}_{i=1,\dots ,F}$ be random variables for the proportion of times that the methylation configuration of the HpaII sites of a chromosome generated fragment f_{i}. In this generative approach, ${z}_{i}\sim {N}\left({\mathrm{\mu}}_{i},\frac{{\mathrm{\mu}}_{i}(1{\mathrm{\mu}}_{i})}{wq}\right)$ where ${\mathrm{\mu}}_{i}=(1{y}_{s})(1{y}_{e})\mathrm{\Pi}{y}_{m}$ and y_{s} and y_{e} are the sites on the upstream and downstream boundaries of fragment f_{i}, y_{m} are the sites between y_{s} and y_{e}, q is the number of cells in the digest, and w is the number of chromosome copies in each cell (for human w=2).^{1}
Let $\{{X}_{i}{\}}_{i=1,\dots ,F}$ be random variables for the number of times fragment f_{i} was sequenced (the number of pairedend reads that were sequenced from fragment f_{i}). In our generative model, X_{i} follows a Poisson process in which the expected number of reads to be sequenced is ${\mathrm{\lambda}}_{i}={z}_{i}{\mathrm{\theta}}_{{l}_{i}}$, where ${\mathrm{\theta}}_{{l}_{i}}$ is a factor determined by the length of the fragment, and depends on the specific experiment technicalities, such as the total amount of sequencing in the experiment. We condition X_{i} on the length of f_{i} since, as described in the previous section, the size selection step in the experiment is not precise, and moreover, fragments of different lengths have different probabilities of being (p.372) sequenced, due to technicalities of the sequencing machinery. Fig. 14.3 shows the dependences of our generative model in a graphical model representation.
We have made a few simplifying assumptions in this generative model because they greatly simplify the dependences among variables and result in a tractable model that is convenient to work with for inference purposes. The first simplifying assumption we make is that the methylation states of neighboring HpaII sites are independent. Second, in contrast to many generative models of RNA sequencing (RNAseq) [42], in this model we assume X_{i}, the number of times fragment f_{i} is sequenced, depends on the proportion of cells from which f_{i} was generated (Z_{i}), but not on the relative proportion to which f_{i} is present in the digest $\left(\frac{{z}_{i}}{\sum _{j=1}^{F}{z}_{j}}\right)$. We allow ourselves to make this assumption in the methylSeq case because, in RNAseq, the contribution of each fragment to the digest is bounded by the number of cells (each cell can contribute at most $w\cdot {f}_{i}$ fragments), making the differences between fragment frequencies much smaller than in RNAseq experiments. This, together with the fact that in methylSeq the number of possible fragments (contributors) is much larger than the number of transcripts in RNAseq experiments, brought us to relax the dependence on the relative proportion of f_{i} in the solution.
14.3.3 Parameter Learning and Inference of Posterior Probabilities
Now that we have a model in place, recall that our objective is to infer the posterior probability for each Y_{i}, given the observed X and L variables. We have described a generative model by which we assume the X_{i}’s are generated (and are dependent on the Y_{i}’s). In this section, we will describe how we can use learning and inference algorithms to attain estimates of the Y variables given the experimental results. For this purpose we will make a few simplifying changes to our model: we (p.373) discretize the Y variables such that $\{{Y}_{i}{\}}_{i=1,\dots ,N}\in \{0.1,0.3,0.5,0.7,0.9\}$, and we assume that the Z variables are determined deterministically by the Y variables, such that ${z}_{i}={\mathrm{\mu}}_{i}$.
To summarize, the list of variables, parameters and dependences in our model is as follows:
Variables:

${Y}_{1},\dots ,{Y}_{N}\in \{0.1,0.3,0.5,0.7,0.9\}$: for each HpaII site, the proportion of cells in the digest that are methylated.

${Z}_{1},\dots ,{Z}_{F}\in [0,1]$: the proportion of cells from which fragment f_{i} was generated.

${L}_{1},\dots ,{L}_{F}\in \{1,\dots ,{l}_{max}\}$: the length in bps of fragment f_{i}.

${X}_{1},\dots ,{X}_{F}\in \{1,\dots ,{x}_{max}\}$: the number of times fragment f_{i} was sequenced.
Parameters:

${\mathrm{\theta}}_{L}=({\mathrm{\theta}}_{120},{\mathrm{\theta}}_{2140},\dots ,{\mathrm{\theta}}_{381400},{\mathrm{\theta}}_{\ge 400})$.
Dependence functions:

${z}_{i}=(1{y}_{{s}_{i}})(1{y}_{{e}_{i}}){\mathrm{\Pi}}_{j={s}_{i}+1}^{{e}_{i}1}{y}_{j}$, where s_{i} and e_{i} are the indexes of the locations at which f_{i} begins and ends.

$\mathbb{P}({x}_{i}{z}_{i},{l}_{i})=\frac{{\mathrm{\lambda}}^{{x}_{i}}exp\{\mathrm{\lambda}\}}{{x}_{i}!}$, where $\mathrm{\lambda}={z}_{i}f({l}_{i})$, and $f:\{1,\dots ,{l}_{max}\}\to {\mathrm{\theta}}_{L}$ is a function that takes in l_{i} and returns θ_{s−e} such that $s\le {l}_{i}\le e$.

In practice, fragments that are very short or very long relative to the boundaries of the size selection step have probability zero of being observed, and we can omit the corresponding variables from the model. This significantly reduces the number of variables in the model, and the complexity of parameter learning and inference procedures.
Although we listed $\{{Z}_{i}{\}}_{i=1,\dots ,F}\in [0,1]$ for convenience, we note that when the Y variables are discrete, so are the Z variables. The number of variables a specific Z_{i} is directly dependent on is determined by the number of HpaII sites that are on the fragment Z_{i} represents. Due to the relatively small sizeselection bounds (50–300 bps) and the HpaII restriction site being of length four (CCGG), the majority of Z variables are directly dependent on a small number of Y variables. This is important from practical reasons, because when conducting the inference step in the EM approach (see below), the Bayesian network is moralized,^{2} and the running time of the inference procedure is dependent on the sizes of the cliques introduced (from a computational complexity perspective, the running time is exponential in the size of the largest clique). To avoid our moralized graph from encompassing large cliques we determine a priori a maximal number of parents we allow for any Z_{i} variable. Variables with more parents than allowed are not incorporated into the model. While this results in ignoring data from fragments that encompass too many HpaII sites, it has a crucial impact on lowering the computational complexity of the inference procedure.
The direct output of the experiment is a list of paired sequenced reads. Each pair of reads is mapped to a reference genome to determine which genomic fragment it originated from (see SubSection 14.2.2). In practice, some of the pairs cannot be mapped uniquely to one fragment of the genome. To resolve this, one may add another layer to the model, taking into account the uncertainty regarding which fragment the reads originated from. However a simpler approach is to disregard from the model fragments that can generate pairs of reads that do not uniquely map.
(p.374) In the same manner that we have a variable determining the length of each fragment, we can consider additional variables for characteristics that affect the extent of sequencing (such as GC content [2, 12]). However one must take into account that in the current setting, this will result in an exponential growth in the number of parameters.
The EM Approach
We would like to compute for each Y_{i} its posterior distribution given the observed data. Namely, we are interested in $\mathbb{P}({Y}_{i}\mathit{\text{X}},\mathit{\text{L}})$ for all Y_{i} variables. Given some parameter assignment to θ_{L}, we can compute the posterior probabilities $\mathbb{P}({Y}_{i}\mathit{\text{X}},\mathit{\text{L}},{\mathrm{\theta}}_{L})$ in time that is exponential in the size of the largest clique of the moralized graph, using the junction tree algorithm [29]. This computation however requires a parameter assignment, which we do not know.
On the other hand, if we were to observe a full assignment of all variables in the Bayesian network (Y,Z,L, and X), let such a data set be ${D}$, we could learn a set of parameters for our model using the maximum likelihood approach by finding the set of parameters that maximizes the likelihood function relative to ${D}$. We give an outline of these computations here, and leave the details for Appendix 14.8 (page 14.8). From the Bayesian network structure we can construct the likelihood function:
assuming a uniform prior over Y and L and that the probability of ${D}$ is not zero (that all z_{i} assignments are as expected given the y_{i} assignments). The function f(l_{i}) is as defined in the “dependence function” annotation. The loglikelihood is therefore:
where θ_{L(j)} is the jth element of θ_{L} and C is the part of the equation with elements that do not include instances from θ_{L}. We would like to find θ_{L} which maximizes the likelihood function. Each θ_{L(j)} can be optimized separately, and by differentiating the loglikelihood, setting to zero and solving for θ_{L(j)}, we get the maximum likelihood parameters:
To conclude, in the presence of a set of parameters we can compute the posterior distribution for the Ys, and in the presence of a fully observed data set, we can compute the maximum likelihood estimator for the parameters. This is a good setting to use the EM algorithm, which returns a parameter assignment for our model, given the observed data. Given an initial assignment of (arbitrary) parameters, the algorithm computes the expected values of the numerator and denominator used for the maximum likelihood parameter estimation. It then updates the parameters of the model using the expected values computed. This algorithm is guaranteed to converge to a (p.375) stationary point of the loglikelihood function. We start with some (possibly random) parameter assignment, ${\mathrm{\theta}}_{L}^{1}$, and iterate between the following two steps:
Expectation step (Estep): In this step the algorithm uses the parameters from the previous maximization step (Mstep), ${\mathrm{\theta}}_{L}^{t}$, to compute the expected values for the sufficient statistics used in the maximum likelihood parameter estimation. The expected values are computed for sufficient statistics that incorporate “hidden” (unobserved) variables of the model. The expected sufficient statistic that needs to be computed for θ_{L(j)} is
$$\begin{array}{rl}{\mathbb{E}}_{ZX,L,{\mathrm{\theta}}_{\mathit{\text{L}}}^{t}}\left[\sum _{i}{Z}_{i}\text{}\mathbf{\text{1}}\{f({l}_{i})={\mathrm{\theta}}_{L(j)}\}\right]& =\sum _{i}{\mathbb{E}}_{ZX,L,{\mathrm{\theta}}_{L}^{t}}\left[{Z}_{i}\text{}\mathbf{\text{1}}\{f({l}_{i})={\mathrm{\theta}}_{L(j)}\}\right]\\ & =\sum _{i}\sum _{Z}{Z}_{i}\mathbb{P}({z}_{i}X,\mathit{\text{L}},{\mathrm{\theta}}_{\mathit{\text{L}}})\text{}\mathbf{\text{1}}\{f({l}_{i})={\mathrm{\theta}}_{L(j)}\},\end{array}$$and can be computed using the posterior distributions obtained by the junctiontree algorithm on the moralized graph.
Maximization step (Mstep): In this step, we use the expected sufficient statistics from the Estep to perform the maximum likelihood estimation:
$${\mathrm{\theta}}_{L(j)}^{t+1}=\frac{\sum {x}_{i}\text{}\mathbf{\text{1}}\{f({l}_{i})={\mathrm{\theta}}_{L(j)}\}}{{\mathbb{E}}_{ZX,{\mathrm{\theta}}_{\mathit{\text{L}}}^{t}}\left[\sum {z}_{i}\text{}\mathbf{\text{1}}\{f({l}_{i})={\mathrm{\theta}}_{L(j)}\}\right]\}}.$$
The algorithm iterates between the Estep and the Mstep until the increase in the likelihood function is smaller than a given threshold. After the algorithm has converged at ${\mathrm{\theta}}_{\mathit{\text{L}}}^{f}$, we can compute for each Y_{i} its posterior distribution using a final run of the junction tree algorithm.
14.4 Genomic Structure as a Prior on Methylation Status
In the generative model described in the previous section, we did not assume any prior knowledge about the methylation states of different HpaII sites. However, it is well known that one can make use of the genomic sequence to gain some information regarding the probability that a specific HpaII site is methylated. This is because in vertebrates, unmethylated sites tend to cluster together and CpG sites that are constitutively unmethylated are more conserved than other CpG sites. This results in the effect that regions that tend to be unmethylated are more CpG rich than regions that tend to be methylated [4]. There are several methods to annotate such CpGrich regions, based on genomic sequences, in an effort to find regions that have interesting DNA methylation behavior. Such regions are called “CpG islands”.
The precise definition of CpG island regions and their methylation states is a challenging task that has been the subject of much research throughout the years [4, 16, 17, 25, 50, 54, 53, 56, 61]. This is not surprising if we recall that DNA methylation states are more dynamic than the DNA sequence. For example, DNA methylation states may be different across the different tissues of (p.376) an individual [26, 53]. Thus, a purely sequence based definition of a “CpG island” is problematic. Nonetheless, in the case of experimental annotation of unmethylated sites, clusters of CpGs provide evidence against methylation that should ideally be incorporated into our model.
We therefore add a hidden random variable for each HpaII site, $\{{W}_{i}{\}}_{i=1,\dots ,N}\in \{1,0\}$, indicating whether the ith HpaII site is in an unmethylated cluster or not. We note that in our setting we assume that there are unmethylated clusters in the experiment, and the W variables denote the presence of a HpaII site in such a cluster. Different methylSeq experiments, on different tissues for example, may have a different set of unmethylated clusters. In addition, we add for each HpaII site two observed variables, denoted by C and D. The notation $\{{C}_{i}{\}}_{i=1,\dots ,N}\in \{1,\dots ,{c}_{max}\}$ indicates the number of CpG sites between the (i−1)^{th} HpaII site and the ith HpaII site (any CpGs that are not part of a CCGG site). The notation $\{{D}_{i}{\}}_{i=1,\dots ,N}\in \{1,\dots ,{d}_{max}\}$ indicates the distance in bps between the (i−1)^{th} HpaII site and the ith HpaII site. A graphical model representation of the model augmented with these additions can be seen in Fig. 14.4. The list of variables, parameters, and dependence functions added to our model is the following:
Variables:

${W}_{1},\dots ,{W}_{N}\in \{1,0\}$: indicates for HpaII site i whether it is present in an unmethylated island (1) or not (0).

(p.377) ${C}_{1},\dots ,{C}_{N}\in \{1,\dots {c}_{max}\}$: the number of CpG sites between the previous and current HpaII sites.

${D}_{1},\dots ,{D}_{N}\in \{1,\dots ,{d}_{max}\}$: the distance between the previous and current HpaII sites.
Parameters:

${\mathrm{\theta}}_{YW}=\{{\mathrm{\theta}}_{{y}^{0}{w}^{0}},\dots ,{\mathrm{\theta}}_{{y}^{4}{w}^{0}},{\mathrm{\theta}}_{{y}^{0}{w}^{1}},\dots ,{\mathrm{\theta}}_{{y}^{4}{w}^{1}}\}$, \\ where $0\le {\mathrm{\theta}}_{{y}^{i}{w}^{j}}\le 1$, $\sum _{i}{\mathrm{\theta}}_{{y}^{i}{w}^{0}}=1$ and $\sum _{i}{\mathrm{\theta}}_{{y}^{i}{w}^{1}}=1$.

${\mathrm{\theta}}_{W{W}_{1},C,D}=\{{\mathrm{\theta}}_{{w}^{a}{w}_{1}^{b},{c}_{({l}_{C},{h}_{C})},{d}_{({l}_{D},{h}_{D})}}\}$ where \ $a,b\in \{0,1\},$ \ $({l}_{C},{h}_{C})\in \{(0,2),\dots ,({2}^{k1}+1,{2}^{k}),\dots ,(513,1024),(\ge 1025)\}$, \ $({l}_{D},{h}_{D})\in \{(0,5),\dots ,(5\cdot {2}^{k1}+1,5\cdot {2}^{k}),\dots ,(1280,2560),(\ge 2561)\}$, \\ and w_{−1} denotes the states of the W variable that is the parent of the W variable at hand. \ For all indexes, $0\le {\mathrm{\theta}}_{{w}^{a}{w}_{1}^{b},{c}_{({l}_{C},{h}_{C})},{d}_{({l}_{D},{h}_{D})},}\le 1$, and \ $\sum _{t=\{0,1\}}{\mathrm{\theta}}_{{w}^{t}{w}_{1}^{b},{c}_{({l}_{C},{h}_{C})},{d}_{({l}_{D},{h}_{D})}}=1$.
Dependence functions:

All of the new dependence functions are fully described by the parameters added. The way in which every variable is dependent on its parents is fully captured by the parameters specified.

Applying the EM Algorithm
In order to make inferences from this model, for the same reasons discussed in the previous section, we make use of the EM algorithm. Assuming we have a fully observed data set, we can use the likelihood function as in the previous section to derive the maximum likelihood estimators for our model parameters:
and
where ${f}_{1}:\{1,\dots {c}_{max}\}\to ({l}_{C},{h}_{C})$ takes in c_{i} and returns a pair, $({l}_{C},{h}_{C})$, such that ${l}_{C}\le {c}_{i}\le {h}_{C}$, and ${f}_{2}:\{1,\dots ,{d}_{max}\}\to ({l}_{D},{h}_{D})$ takes in d_{i} and returns a pair, (l_{D},h_{D}), such that ${l}_{D}\le {d}_{i}\le {h}_{D}$.
As before, we start with some (possibly random) assignment of the parameters, and iterate between the following two steps until we determine convergence. For convenience, we denote by θ the complete parameter set for our model.
Expectation step (Estep): In this step, the algorithm uses the parameters from the last Mstep, θ^{t}, to compute the expected values for the sufficient statistics used in the maximum likelihood parameter estimation (that incorporate “hidden” variables). For ${\mathrm{\theta}}_{L(i)}$, this is done as previously described. Let O = (X, L, C, D) be the set of observed variables, and H = (W, Y, Z) be the set of hidden (p.378) variables. Using the junctiontree algorithm on the moralized graph, we can compute the probability of any assignment to some variable, Q, and its parents, U, given the observed data and θ^{t}. In other words, after a run of the junctiontree algorithm, we know $\mathbb{P}(\mathit{\text{q}},\mathit{\text{u}}\mathit{\text{O}},{\mathrm{\theta}}^{t})$ for every assignment of Q and U, (q u). We can then compute the needed expected sufficient statistics. For ${\mathrm{\theta}}_{{y}^{a}{w}^{b}}$ we compute
$$\begin{array}{rcl}{\mathbb{E}}_{HO,{\mathrm{\theta}}^{t}}[\sum _{i}\text{}\mathbf{\text{1}}\{{y}_{i}={y}^{a},{w}_{i}={w}^{b}\}]& =& \sum _{i}{\mathbb{E}}_{HO,{\mathrm{\theta}}^{t}}[\text{}\mathbf{\text{1}}\{{y}_{i}={y}^{a},{w}_{i}={w}^{b}\}]\\ & =& \sum _{i}\mathbb{P}({y}_{i}={y}^{a},{w}_{i}={w}^{b}O,{\mathrm{\theta}}^{t})\end{array}$$and
$${\mathbb{E}}_{HO,{\mathrm{\theta}}^{t}}[\sum _{i}\text{}\mathbf{\text{1}}\{{w}_{i}={w}^{b}\}]=\sum _{i}\mathbb{P}({w}_{i}={w}^{b}O,{\mathrm{\theta}}^{t}).$$$$\begin{array}{rl}& {\mathbb{E}}_{HO,{\mathrm{\theta}}^{t}}[\sum _{i}\text{}\mathbf{\text{1}}\{{w}_{i}={w}^{a},{w}_{i1}={w}_{1}^{b},{f}_{1}({c}_{i})={c}_{({l}_{C},{h}_{C})},{f}_{2}({d}_{i})={d}_{({l}_{D},{h}_{D})}\}]\\ & \phantom{\rule{1em}{0ex}}=\sum _{i}\mathbb{P}({w}_{i}={w}^{a},{w}_{i1}={w}_{1}^{b},{f}_{1}({c}_{i})={c}_{({l}_{C},{h}_{C})},{f}_{2}({d}_{i})={d}_{({l}_{D},{h}_{D})}O,{\mathrm{\theta}}^{t})\end{array}$$and
$$\begin{array}{rl}& {\mathbb{E}}_{HO,{\mathrm{\theta}}^{t}}[\sum _{i}\text{}\mathbf{\text{1}}\{{w}_{i1}={w}_{1}^{b},{f}_{1}({c}_{i})={c}_{({l}_{C},{h}_{C})},{f}_{2}({d}_{i})={d}_{({l}_{D},{h}_{D})}\}]\\ & \phantom{\rule{1em}{0ex}}=\sum _{i}\mathbb{P}({w}_{i1}={w}_{1}^{b},{f}_{1}({c}_{i})={c}_{({l}_{C},{h}_{C})},{f}_{2}({d}_{i})={d}_{({l}_{D},{h}_{D})}O,{\mathrm{\theta}}^{t}).\end{array}$$
We recall that the running time complexity of the junctiontree algorithm is exponential in the size of the maximum sized clique of the moralized graph, and it is easy to see how the moralized network can be chorded such that, aside from the cliques generated in the previous model, only cliques of size three are introduced.
Maximization step (Mstep): In this step we generate an updated set of parameters, ${\mathrm{\theta}}^{t+1}$, by using the expected sufficient statistics computed in the Estep. The computation of ${\mathrm{\theta}}_{L}^{t+1}$ remains the same as before, and ${\mathrm{\theta}}_{{y}^{a}{w}^{b}}^{t+1}$ and ${\mathrm{\theta}}_{{w}^{a}{w}_{1}^{b},{d}_{{l}_{D},{h}_{D}},{c}_{{l}_{C},{h}_{C}}}^{t+1}$ are updated as follows:
$${\mathrm{\theta}}_{{y}^{a}{w}^{b}}^{t+1}=\frac{\sum _{i}\mathbb{P}({y}_{i}={y}^{a},{w}_{i}={w}^{b}O,{\mathrm{\theta}}^{t})}{\sum _{i}\mathbb{P}({w}_{i}={w}^{b}O,{\mathrm{\theta}}^{t})},$$$$\begin{array}{rl}& {\mathrm{\theta}}_{{w}^{a}{w}_{1}^{b},{c}_{({l}_{C},{h}_{C})},{d}_{({l}_{D},{h}_{D})}}^{t+1}\\ & =\frac{\sum _{i}\mathbb{P}({w}_{i}={w}^{a},{w}_{i1}={w}_{1}^{b},{f}_{1}({c}_{i})={c}_{({l}_{C},{h}_{C})},{f}_{2}({d}_{i})={d}_{({l}_{D},{h}_{D})}O,{\mathrm{\theta}}^{t})}{\sum _{i}\mathbb{P}({w}_{i1}={w}_{1}^{b},{f}_{1}({c}_{i})={c}_{({l}_{C},{h}_{C})},{f}_{2}({d}_{i})={d}_{({l}_{D},{h}_{D})}O,{\mathrm{\theta}}^{t})}.\end{array}$$
The algorithm iterates between the Estep and the Mstep until the increase in the likelihood function is smaller than a given threshold. After the algorithm has converged at θ^{f}, we can compute the posterior distribution for each Y_{i} by using a final run of the junction tree algorithm. We can now also compute for every W_{i} the probability that the HpaII site at index i is in an unmethylated cluster. We elaborate on this in the next section.
14.5 Application: Methyltyping the Human Neutrophil
In the previous sections, we have described a Bayesian network for the correction of bias introduced in highthroughput sequencing experiments in which the genome is digested in a nonrandom manner followed by a sizeselection step, and have focused on the methylSeq experiment. In this chapter, we present in detail results from a study that has applied such a model for the purpose of characterizing the methyltypes of neutrophil cells from four human individuals [49].
In [49], the authors show that the methyltyping approach applied by methylSeq coupled with a Bayesian network correction procedure, called MetMap, is sufficient to survey methylation states across the genome and provides significant insight into the methylome, inside and outside of CpG islands, at sitespecific resolution. In the study at hand, the methyltypes of four male human individuals were determined, using the standard methylSeq protocol. Sitespecific methylation values could be assigned to 4.8% of the CpGs in the human reference genome, and of these, 20% were inside CpG islands.
The authors show that the precision achieved by using MetMap’s correction was greatly increased over that of using the raw read counts. This was shown by determining the true methylation state of 46 HpaII sites using direct bisulfite sequencing and observing that the Pearson correlation of the MetMap inferred scores with the bisulfite validations, which was 0.90, was significantly higher than the Pearson correlation of the methylation values estimated by the raw fragment counts with the bisulfite validations, which was 0.67.
14.5.1 Unmethylated Clusters
As we have discussed throughout this chapter, unmethylated sites in vertebrate genomes tend to be clustered together. While this enables predicting locations of such clusters from the genomic sequence, if experimental methylation data are available, it is desirable to annotate unmethylated islands directly. This is because DNA methylation varies across tissues and conditions, and experimental data will reveal a more precise picture of the methylation state for the specific experiment.
In [49], the authors identified experimentspecific unmethylated islands, and called them SUMIs (strongly unmethylated islands). These unmethylated clusters are specific to the (p.380) experiment at hand, and therefore to the human neutrophil. An interesting test was to investigate how many of the SUMIs found in the neutrophil experiment were also present in the regions inferred as unmethylated by sequencebased methods (see Fig. 14.5). The authors compared the 20 986 SUMIs they found with regions inferred to be unmethylated by two sequencebased methods: CpG islands found using the University of California, Santa Cruz (UCSC) browser (termed UCSC islands) and “bona fide” CpG islands (termed BF islands) [5]. The UCSC islands are the sequencebased set of islands that are most commonly used, while the BF islands were annotated using a more sophisticated, and potentially more exact, method. Of the 20 986 SUMIs present in at least one of the four individuals, 4 652 do not overlap the UCSC islands, and 7 055 do not overlap the BF islands.
The authors could validate the sensitivity and specificity of the SUMI set, compared to the sequencebased methods, by using direct bisulfite sequencing. Using this method, it was possible to annotate the methylation states of all the CpG sites at stretches of several hundred bps. As a validation of specificity, five regions that were annotated as part of both a UCSC island and a BF island, and did not overlap with SUMIs, were bisulfite sequenced. All regions were found to be methylated in the neutrophil sample. As a validation of sensitivity, four regions that overlapped SUMIs and UCSC islands but not BF islands were bisulfite sequenced. All regions were found to be unmethylated in the neutrophil sample. These results demonstrated that the true methylation state in a specific tissue may be different than the methylation state inferred by the genome sequence alone, and perhaps of the methylation state present in other cell types.
Interestingly, 3 797 SUMIs did not overlap with BF islands or UCSC islands, revealing new regions that are unmethylated in neutrophil cells. These regions showed strong association with regions associated with other functional features, such as open chromatin, high sequence conservation, gene regions, and the 5ʹ end of genes. Table 14.2 displays the association of the different sets of islands with these features. The authors point out that the proportion of SUMIs that map at a distance from transcription start sites is larger for novel SUMIs than for all SUMIs, but that novel SUMIs have a degree of association with open chromatin similar to that observed for all SUMIs; this suggests that novel SUMIs may often represent distal regulatory sequences.
Table 14.2 Percentages of neutrophil strongly unmethylated islands, CpG islands annotated by the University of California, Santa Cruz browser, and bona fide CpG islands that overlap regions associated with functionality. For details on how the functional regions were determined, see [49].
SUMIs (%) 
UCSC CpG Islands (%) 
BF Islands (%) 
Novel SUMIs (%) 


Open chromatin 
70.0% 
52.9% 
65.3% 
61.0% 
Conservation 
71.1% 
68.5% 
76.2% 
49.6% 
Gene regions 
76.9% 
77.7% 
79.7% 
59.9% 
TSS regions 
59.8% 
52.2% 
61.4% 
22.0% 
Note: BF, bona fide; SUMI, strongly unmethylated islands; TSS, transcription start site; UCSC, University of California, Santa Cruz.
(p.381) 14.6 Conclusions
In this chapter, we have demonstrated how Bayesian networks can be utilized in the genomescale study of DNA methylation. By designing a generative model that incorporates hidden variables for the biological states and observed variables for the generated experimental data we were able to predict the biological states, even when such inference was not straightforward.
The need for a Bayesian network comes from the experiment of interest (methylSeq) encompassing a nonrandom digestion of the genome followed by a sizeselection step. Bayesian networks are suitable for such a case because they can be used to model the dependences between neighboring sites. The model presented can be adjusted to account for similar biases in several other methyltyping methods including the use of several different methylationsensitive restriction enzymes. It is also expected to prove useful in other highthroughput sequencing experiments that are prone to such biases, for example some of the recent protocols developed for determining RNA secondary structure [58].
Acknowledgments
We thank Dario Boffelli and David I.K. Martin for many extensive discussions on DNA methylation. M.S. and L.P. were funded in part by NIH grant R01 HG00612901.
Appendix 14.A Computations Used For The Em Approach
In this appendix, we give a detailed description of the computations involved in annotating the maximum likelihood parameters, given an assignment to all states of the Bayesian network shown in Fig. 14.3. Given some assignment $\{\mathit{\text{y}},\mathit{\text{z}},\mathit{\text{l}},\mathit{\text{x}}\}$ to the variables Y Z L and X the Bayesian network implies the likelihood function:
where Q is the constant for $\left[{\mathrm{\Pi}}_{j=1}^{N}\mathbb{P}({y}_{j})\right]$, W is the constant for $\left[{\mathrm{\Pi}}_{i=1}^{F}\mathbb{P}({l}_{i})\right]$ and we assume that the probability of ${D}$ is not zero ($\mathbb{P}({z}_{i}y)=1$ for all Z_{i} of ${D}$). The function f(l_{i}) is as defined in the “dependence function” annotation. Notice that in this setting, we are assuming a full assignment to all random variables of the model, and therefore terms that do not consider parameters can be referred to as constants. Our goal is to find the assignment of parameters that maximize the above likelihood function. We can do this by finding the parameter assignment that maximizes the loglikelihood:
where θ_{L(k)} is the kth element of θ_{L} and C is the part of the equation with elements that do not include instances from θ_{L}. We would like to find θ_{L} that maximizes the likelihood function, which we denote by ${\stackrel{\u02c6}{\mathrm{\theta}}}_{L}$. We denote the jth element of ${\stackrel{\u02c6}{\mathrm{\theta}}}_{L}$ by ${\stackrel{\u02c6}{\mathrm{\theta}}}_{L(j)}$. For each j we can find ${\stackrel{\u02c6}{\mathrm{\theta}}}_{L(j)}$ by differentiating the loglikelihood, setting to zero and solving for θ_{L(j)}:
Solving this for θ_{L(j)}, we get:
Bibliography references:
[1] M.P. Ball, J.B. Li, Y. Gao, J. Lee, E.M. Leproust, I. Park, B. Xie, G. Q. Daley, and G.M. Church. Targeted and genomescale strategies reveal genebody methylation signatures in human cells. Nature Biotechnology, 27(4):361–368, 2009.
[2] Y. Benjamini and T.P. Speed. Summarizing and correcting the GC content bias in highthroughput sequencing. Nucleic Acids Research, 40(10): e72, 2012.
[3] B.P. Berman, D.J. Weisenberger, J.F. Aman, T. Hinoue, Z. Ramjan, Y. Liu, H. Noushmehr, C.P.E. Lange, C.M. van Dijk, R.A.E.M. Tollenaar, D. Van Den Berg, and P.W. Laird. Regions of focal DNA hypermethylation and longrange hypomethylation in colorectal cancer coincide with nuclear lamina associated domains. Nature Genetics, 44(1):40–46, 2011.
[4] A.P. Bird. CpGrich islands and the function of DNA methylation. Nature, 321(6067):209–213, 1986.
[5] C. Bock, J. Walter, M. Paulsen, and T. Lengauer. CpG island mapping by epigenome prediction. PLOS Computational Biology, 3(6):e110, 2007.
[6] J. Borgel, S. Guibert, Y. Li, H. Chiba, D. Schübeler, H. Sasaki, T. Forné, and M. Weber. Targets and dynamics of promoter DNA methylation during early mouse development. Nature Genetics, 42(12):1093–1100, 2010.
[7] C.V. Breton, H.M. Byun, M. Wenten, F. Pan, A. Yang, and F.D. Gilliland. Prenatal tobacco smoke exposure affects global and genespecific DNA methylation. American Journal of Respiratory and Critical Care Medicine, 180(5):462–467, 2009.
[8] E.I. Campos and D. Reinberg. Histones: annotating chromatin. Annual Review of Genetics, 43(1):559–599, 2009.
[9] H. Cedar and Y. Bergman. Epigenetics of haematopoietic cell development. Nature Reviews Immunology, 11(7):478–488, 2011.
[10] N.M. Cohen, V. Dighe, G. Landan, S. Reynisdóttir, A. Palsson, S. Mitalipov, and A. Tanay. DNA methylation programming and reprogramming in primate embryonic stem cells. Genome Research, 19(12):2193–2201, 2009.
[11] S.J. Cokus, S. Feng, X. Zhang, Z. Chen, B. Merriman, C.D. Haudenschild, S. Pradhan, S.F. Nelson, M. Pellegrini, and S.E. Jacobsen. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature, 452:215–219, 2008.
[12] J.C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer. Substantial biases in ultrashort read data sets from highthroughput DNA sequencing. Nucleic Acids Research, 36(16):e105, 2008.
[13] M. Ehrlich. DNA methylation in cancer: too much, but also too little. Oncogene, 21(35):5400–5413, 2002.
[14] J. Ernst, P. Kheradpour, T.S. Mikkelsen, N. Shoresh, L.D. Ward, C.B. Epstein, X. Zhang, L. Wang, R. Issner, M. Coyne, M. Ku, T. Durham, M. Kellis, and B.E. Bernstein. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345):43–49, 2011.
[15] M.F. Fraga, E. Ballestar, M.F. Paz, S. Ropero, F. Setien, M.L. Ballestar, D. HeineSuner, J.C. Cigudosa, M. Urioste, J. Benitez, M. BoixChornet, A. SanchezAguilera, C. Ling, E. Carlsson, P. Poulsen, A. Vaag, Z. Stephan, T.D. Spector, Y.Z. Wu, C. Plass, and M. Esteller. Epigenetic differences arise during the lifetime of monozygotic twins. Proceedings of the National Academy of Sciences of the United States of America, 102(30):10604–10609, 2005.
[16] M. GardinerGarden and M. Frommer. CpG islands in vertebrate genomes. Journal of Molecular Biology, 196(2):261–282, 1987.
[17] J.L. Glass, R.F. Thompson, B. Khulan, M.E. Figueroa, E.N. Olivier, E.J. Oakley, G. Van Zant, E.E. Bouhassira, A. Melnick, A. Golden, M.J. Fazzari, and J.M. Greally. CG dinucleotide clustering is a speciesspecific property of the genome. Nucleic Acids Research, 35(20):6798–6807, 2007.
[18] (p.384) M.G. Goll and T.H. Bestor. Eukaryotic cytosine methyltransferases. Annual Review of Biochemistry, 74(1):481–514, 2005.
[19] A. Goren and H. Cedar. Replicating by the clock. Nature Reviews Molecular Cell Biology, 4(1):25–32, 2003.
[20] R.T. GrantDownton and H.G. Dickinson. Epigenetics and its implications for plant biology 2. The ‘epigenetic epiphany’: epigenetics, evolution and beyond. Annals of Botany, 97(1):11–27, 2006.
[21] S.I.S. Grewal and D. Moazed. Heterochromatin and epigenetic control of gene expression. Science, 301(5634):798–802, 2003.
[22] H. Gu, C. Bock, T.S. Mikkelsen, N. Jager, Z.D. Smith, E. Tomazou, A. Gnirke, E.S. Lander, and A. Meissner. Genomescale DNA methylation mapping of clinical samples at singlenucleotide resolution. Nature Methods, 7(2):133–136, 2010.
[23] I.R. Henderson and S.E. Jacobsen. Epigenetic inheritance in plants. Nature, 447(7143):418–424, 2007.
[24] H. Hikoya. Discovery of bisulfitemediated cytosine conversion to uracil, the key reaction for DNA methylation analysis—a personal account. Proceedings of the Japan Academy, Series B, 84(8):321–330, 2008.
[25] F. Hsieh, S.C. Chen, and K. Pollard. A nearly exhaustive search for CpG islands on whole chromosomes. The International Journal of Biostatistics, 5(1):14, 2009.
[26] R. Illingworth, A. Kerr, D. Desousa, H. Jørgensen, P. Ellis, J. Stalker, D. Jackson, C. Clee, R. Plumb, J. Rogers, S. Humphray, T. Cox, C. Langford, and A. Bird. A novel CpG island set identifies tissuespecific methylation at developmental gene loci. PLOS Biology, 6(1):e22, 2008.
[27] H. Ji, L.I.R. Ehrlich, J. Seita, P. Murakami, A. Doi, P. Lindau, H. Lee, M.J. Aryee, R.A. Irizarry, K. Kim, D.J. Rossi, M.A. Inlay, T. Serwold, H. Karsunky, L. Ho, G.Q. Daley, I.L. Weissman, and A.P. Feinberg. Comprehensive methylome map of lineage commitment from haematopoietic progenitors. Nature, 467(7313):338–342, 2010.
[28] P.A. Jones and S.B. Baylin. The epigenomics of cancer. Cell, 128(4):683–692, 2007.
[29] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.
[30] T. Kouzarides. Chromatin modifications and their function. Cell, 128(4):693–705, 2007.
[31] P. W. Laird. Principles and challenges of genomewide DNA methylation analysis. Nature Reviews Genetics, 11(3):191–203, 2010.
[32] B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25, 2009.
[33] E. Li, T. H. Bestor, and R. Jaenisch. Targeted mutation of the DNA methyltransferase gene results in embryonic lethality. Cell, 69(6):915–926, 1992.
[34] R. Lister, R. C. O’Malley, J. TontiFilippini, B. D. Gregory, C. C. Berry, H. A. Millar, and J. R. Ecker. Highly integrated singlebase resolution maps of the epigenome in Arabidopsis. Cell, 133(3):523–536, 2008.
[35] R. Lister, M. Pelizzola, R.H. Dowen, R.D. Hawkins, G. Hon, J. TontiFilippini, J.R. Nery, L. Lee, Z. Ye, Q.M. Ngo, L. Edsall, J. AntosiewiczBourget, R. Stewart, V. Ruotti, A.H. Millar, J.A. Thomson, B. Ren, and J.R. Ecker. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462(7271):315–322, 2009.
[36] D.I. Martin, M. Singer, J. Dhahbi, G. Mao, L. Zhang, G. P. Schroth, L. Pachter, and D. Boffelli. Phyloepigenomic comparison of great apes reveals a correlation between somatic and germline methylation states. Genome Research, 21(12):2049–2057, 2011.
[37] A.K. Maunakea, R.P. Nagarajan, M. Bilenky, T.J. Ballinger, C. D’Souza, S.D. Fouse, B.E. Johnson, C. Hong, C. Nielsen, Y. Zhao, G. Turecki, A. Delaney, R. Varhol, N. Thiessen, K. Shchors, V.M. Heine, D.H. Rowitch, X. Xing, C. Fiore, M. Schillebeeckx, S.J.M. Jones, D. Haussler, M.A. Marra, M. Hirst, T. Wang, and J.F. Costello. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature, 466(7303):253–257, 2010.
[38] (p.385) A. Meissner, A. Gnirke, G.W. Bell, B. Ramsahoye, E.S. Lander, and R. Jaenisch. Reduced representation bisulfite sequencing for comparative highresolution DNA methylation analysis. Nucleic Acids Research, 33(18):5868–5877, 2005.
[39] A. Meissner, T. S. Mikkelsen, H. Gu, M. Wernig, J. Hanna, A. Sivachenko, X. Zhang, B. E. Bernstein, C. Nusbaum, D. B. Jaffe, A. Gnirke, R. Jaenisch, and E. S. Lander. Genomescale DNA methylation maps of pluripotent and differentiated cells. Nature, 454:766–770, 2008.
[40] F. Mohn, M. Weber, D. Schübeler, and T.C. Roloff. Methylated DNA Immunoprecipitation (MeDIP). In J. M. Walker and J. Tost, editors, DNA Methylation, volume 507 of Methods in Molecular Biology, pages 55–64. Humana Press, 2009.
[41] H.D. Morgan, H.G. Sutherland, D.I. Martin, and E. Whitelaw. Epigenetic inheritance at the agouti locus in the mouse. Nature Genetics, 23(3):314–318, 1999.
[42] L. Pachter. Models for transcript quantification from RNASeq. arXiv:1104.3889v2, 2011.
[43] C. A. Perry, C. D. Allis, and A. T. Annunziato. Parental nucleosomes segregated to newly replicated chromatin are underacetylated relative to those assembled de novo. Biochemistry, 32(49):13615–13623, 1993.
[44] V.K. Rakyan, S. Chong, M.E. Champ, P.C. Cuthbert, H.D. Morgan, K.V.K. Luu, and E. Whitelaw. Transgenerational inheritance of epigenetic states at the murine Axin(Fu) allele occurs after maternal and paternal transmission. Proceedings of the National Academy of Sciences of the United States of America, 100(5):2538–2543, 2003.
[45] V.K. Rakyan, T.A. Down, D.J. Balding, and S. Beck. Epigenomewide association studies for common human diseases. Nature Reviews Genetics, 12(8):529–541, 2011.
[46] K.D. Robertson. DNA methylation and human disease. Nature Reviews Genetics, 6(8):597–610, 2005.
[47] R.J. Schmitz, M.D. Schultz, M.G. Lewsey, R.C. O’Malley, M.A. Urich, O. Libiger, N.J. Schork, and J.R. Ecker. Transgenerational epigenetic instability is a source of novel methylation variants. Science, 334(6054):369–373, 2011.
[48] Y. Shufaro, O. LachamKaplan, B.Z. Tzuberi, J. McLaughlin, A. Trounson, H. Cedar, and B.E. Reubinoff. Reprogramming of DNA replication timing. Stem Cells, 28(3):443–449, 2010.
[49] M. Singer, D. Boffelli, J. Dhabhi, A. Schönhuth, G.P. Schroth, D.I. Martin, and L. Pachter. MetMap enables genomescale methyltyping for determining methylation states in populations. PLOS Computational Biology, 6(8):e1000888, 2010.
[50] M. Singer, A. Engström, A. Schönhuth, and L. Pachter. Determining coding CpG islands by identifying regions significant for pattern statistics on Markov chains. Statistical Applications in Genetics and Molecular Biology, 10(1):43, 2011.
[51] M. Slatkin. Epigenetic inheritance and the missing heritability problem. Genetics, 182(3):845–850, 2009.
[52] B. D. Strahl and C. D. Allis. The language of covalent histone modifications. Nature, 403(6765):41–45, 2000.
[53] R. Straussman, D. Nejman, D. Roberts, I. Steinfeld, B. Blum, N. Benvenisty, I. Simon, Z. Yakhini, and H. Cedar. Developmental programming of CpG island methylation profiles in the human genome. Nature Structural and Molecular Biology, 16(5):564–571, 2009.
[54] Y. Sujuan, A. Asaithambi, and Y. Liu. CpGIF: an algorithm for the identification of CpG islands. Bioinformation, 2(8):335–338, 2008.
[55] M.M.M. Suzuki and A. Bird. DNA methylation landscapes: provocative insights from epigenomics. Nature Reviews Genetics, 9:465–476, 2008.
[56] D. Takai and P.A. Jones. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proceedings of the National Academy of Sciences of the United States of America, 99(6):3740–3745, 2002.
[57] B.M. Turner. Defining an epigenetic code. Nature Cell Biology, 9(1):2–6, 2007.
[58] J.G. Underwood, A.V. Uzilov, S. Katzman, C.S. Onodera, J.E. Mainzer, D.H. Mathews, T.M. Lowe, S.R. Salama, and D. Haussler. FragSeq: transcriptomewide RNA structure probing using highthroughput sequencing. Nature Methods, 7(12):995–1001, 2010.
[59] (p.386) R.Y. Wang, C.W. Gehrke, and M. Ehrlich. Comparison of bisulfite modification of 5methyldeoxycytidine and deoxycytidine residues. Nucleic Acids Research, 8(20):4777–4790, 1980.
[60] M. Weber, I. Hellmann, M.B. Stadler, L. Ramos, S. Pääbo, M. Rebhan, and D. Schübeler. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nature Genetics, 39(4):457–466, 2007.
[61] H. Wu, B. Caffo, H.A. Jaffee, R.A. Irizarry, and A.P. Feinberg. Redefining CpG islands using Hidden Markov Models. Biostatistics, 11(3):499–514, 2010.
[62] A. Zemach, I.E. McDaniel, P. Silva, and D. Zilberman. Genomewide evolutionary analysis of eukaryotic DNA methylation. Science, 328(5980):916–919, 2010.
[63] X. Zhang, J. Yazaki, A. Sundaresan, S. Cokus, S.W.L. Chan, H. Chen, I.R. Henderson, P. Shinn, M. Pellegrini, S.E. Jacobsen, and J.R. Ecker. Genomewide highresolution mapping and functional analysis of DNA methylation in Arabidopsis. Cell, 126(6):1189–1201, 2006.
[64] Y. Zhang, C. Rohde, S. Tierling, T.P. Jurkowski, C. Bock, D. Santacruz, S. Ragozin, R. Reinhardt, M. Groth, J. Walter, and A. Jeltsch. DNA methylation analysis of chromosome 21 gene promoters at single base pair and single allele resolution. PLOS Genetics, 5(3):e1000438, 2009.
Notes:
(^{1}) The number of occurrences of fragment f_{i} in the solution follows a binomial distribution $B(wq,{\mathrm{\mu}}_{i})$, for which ${N}(wq{\mathrm{\mu}}_{i},wq{\mathrm{\mu}}_{i}(1{\mathrm{\mu}}_{i}))$ is a good approximation. Therefore, the proportion of samples from which f_{i} was generated can be approximated by the distribution ${N}\left({\mathrm{\mu}}_{i},{\displaystyle \frac{{\mathrm{\mu}}_{i}(1{\mathrm{\mu}}_{i})}{wq}}\right)$.