Print publication date: 2014

Print ISBN-13: 9780198709022

Published to Oxford Scholarship Online: December 2014

DOI: 10.1093/acprof:oso/9780198709022.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in OSO for personal use (for details see www.oxfordscholarship.com/page/privacy-policy). Subscriber: null; date: 18 January 2019

Bayesian Networks in the Study of Genome-wide DNA Methylation

Chapter:
(p.363) Chapter 14 Bayesian Networks in the Study of Genome-wide DNA Methylation
Source:
Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics
Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780198709022.003.0014

Abstract and Keywords

This chapter explores the use of Bayesian networks in the study of genome-scale deoxyribonucleic acid (DNA) methylation. It begins by describing different experimental methods for the genome-scale annotation of DNA methylation. The Methyl-seq protocol is detailed and the biases induced by this technique are depicted, which constitute as many challenges for further analysis. These challenges are addressed introducing a Bayesian network framework for the analysis of Methyl-seq data. This previous model is extended to incorporate more information from the genomic sequence. Genomic structure is used as a prior on methylation status. A recurring theme is the interplay between the model used to glean information from the technology, and the view of methylation that drives the model specification. Finally, a study is described, in which such models were used, leading to both interesting biological conclusions and to insights about the nature of methylation.

Keywords:   DNA methylation, Methyl-seq, Bayesian networks

Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics. First Edition. Christine Sinoquet & Raphaël Mourad (Eds).© Oxford University Press 2014. Published in 2014 by Oxford University Press.

This chapter explores the use of Bayesian networks in the study of genome-scale DNA methylation. It begins by describing different experimental methods for the genome-scale annotation of DNA methylation. The methyl-Seq method is detailed, and the biases induced by this technique, which constitute as many challenges for further analysis are depicted. These challenges are addressed introducing a Bayesian network framework for the analysis of methyl-Seq data. This previous model is extended to incorporate more information from the genomic sequence. Genomic structure is used as a prior on methylation status: unmethylated sites tend to cluster, and unmethylated sites are more conserved than other sites. A recurring theme is the interplay between the model used to glean information from the technology, and the view of methylation that drives the model specification. Finally, a study is described in which such models were used, leading to both interesting biological conclusions and to insights about the nature of methylation.

(p.364) 14.1 Introduction to Epigenetics

Epigenetic mechanisms influence phenotype through heritable, but potentially reversible, regulation of gene expression. These mechanisms work at many levels including DNA methylation, histone modifications, nucleosome positioning, and replication timing [9]. All of these have been shown to be heritable across cell divisions and to differ across different tissues and cell types [14, 18, 19, 21, 43, 48]. There is a significant body of ongoing research on each of these different mechanisms, and in this section we summarize two of the most studied epigenetic features: DNA methylation and histone modifications.

DNA methylation DNA methylation relates to the covalent attachment of a methyl group to a cytosine (C) nucleotide, in place of the hydrogen atom on the 5 position on the pyrimidine ring, or to the addition of a methyl group to an adenine (A). Adenine methylation has been found only in bacterial genomes [18], and the term “DNA methylation” usually, and throughout this chapter, refers to cytosine methylation. Major events in the cell, such as cell differentiation, X-chromosome inactivation and retrotransposon silencing to name a few, are characterized by DNA methylation, and in mice it has been shown that inhibition of the methylation regulating enzymes results in embryonic lethality [33, 55]. Broadly speaking, DNA methylation is associated with a heterochromatin state and silencing of transcription. In vertebrates, DNA methylation is restricted mostly to CpG sites (Cs that are followed by guanines (Gs)). Most of the vertebrate genome is methylated, and the unmethylated sites tend to cluster together along the genome [4, 55]. Many such unmethylated clusters occur at promoter regions, and their methylation is associated with the silencing of the nearby gene [60].

Histone modification The N-terminal tails of histones are subject to various types of post-translational covalent modifications, including lysine and arginine methylation, lysine acetylation, ubiquitination, and serine phosphorylation [30]. A histone may harbor a number of different modifications at a given time, giving rise to many possible configurations of modifications, sometimes related to as a histone code [52, 57]. Histone modifications can influence the packing assembly of the DNA by moderating a histone’s DNA-binding affinity and by recruiting further chromatin remodeling complexes [30]. By doing so, these modifications can affect gene expression; some modifications are associated with transcriptional repression, and others are associated with transcriptional activation. Both types of modification can be present at either promoter or intragenic regions [8].

Epigenetic phenomena can affect gene regulation and be inherited between cell divisions through a mechanism different than that by which they were initiated. This results in an ability to maintain a regulatory program across cell divisions, enabling a form of cell “memory.” Epigenetics is therefore at the forefront of cell differentiation studies [10, 27, 39], and numerous epigenetic mutations have been associated with various diseases [46]. Specifically, many studies have reported association between altered methylation states and various cancers [3, 13, 28].

Another active field of research concerns deciphering the affects of environmental factors on epigenetic states, along with the extent to which changes in epigenetic states are stochastic. For example, it has been shown that monozygotic twins accumulate differences in DNA methylation throughout their lives, and that the accumulated differences affect their gene expression portrait [15]. Another interesting study has showed that prenatal tobacco smoke exposure affects DNA methylation in the fetus [7].

(p.365) Transgenerational inheritance of DNA methylation has been observed in several loci in mice [6, 41, 44]. In plants, changes in DNA methylation that are inherited across generations have been shown to be rather frequent and seem to be a common way of adapting gene regulation to a changing environment [20, 23, 47]. The implications of these phenomena for theories of inheritance and evolution are currently being explored [51].

The genesis of high-throughput sequencing technologies (also referred to as “next-generation” sequencing) has enabled the study of epigenetic features on a genome-wide scale, giving rise to new fields of study. Genome-wide data sets allow for the characterization of epigenetic phenomena across all known genes and allow for broad comparison studies between genes, tissues, individuals, and species [26, 27, 35, 36, 62]. The presence of genome-scale epigenetic data enables association studies to look at both genomic and epigenomic features in the search for causal variants of disease, and the development of methods for epigenome-wide association studies is underway [45].

In this chapter, we explore the use of Bayesian networks in the study of genome-scale DNA methylation. We begin by describing different experimental methods for the genome-scale annotation of DNA methylation in Section 14.2 and focus on the challenges present in the analysis of a specific method, called methyl-Seq. In Section 14.3, we introduce a Bayesian network for the analysis of methyl-Seq data, and in Section 14.4, we extend this model to incorporate more information from the genomic sequence. A recurring theme is the interplay between the model used to glean information from the technology, and the view of methylation that drives the model specification. We describe a study in which such models were used in Section 14.5, and show in Section 14.6 how the models led to both interesting biological conclusions and to insights about the nature of methylation.

14.2 Next-generation Sequencing and DNA Methylation

The transformative impact of high-throughput sequencing on many fields of biology cannot be overstated. The ability to sequence many short DNA fragments at a low cost per base pair (bp), motivated initially by the goal of lowering the costs of generating the DNA sequence of an individual or species, has found unexpected applications in molecular biology. One of the fields in which sequencing technologies have initiated a revolution over the past few years is epigenetics, and specifically DNA methylation. In this section, we describe the different ways in which high-throughput sequencing is used to study DNA methylation, and explain why there is a relatively large number of methods in use today. Throughout this section we focus on the possibilities and challenges of studying DNA methylation through the use of high-throughput sequencing. We will first outline the different techniques that enable measurement of DNA methylation on a genome-wide scale, and then describe the main methods that make use of these techniques, explaining in detail the methyl-Seq protocol, which is the focus of this chapter.

There are several ways to detect DNA methylation, and a detailed description of the different techniques can be found in [31]. The major techniques currently available are illustrated in Fig. 14.1 (page 14.2.1) and include:

• Enzyme digestion: This technique uses restriction enzymes that in addition to being sequence-specific are methylation sensitive (cut only if their recognition site is unmethylated). A notable example is HpaII, which digests at CCGG sites at which the second cytosine is unmethylated (p.366) (in this chapter we denote a C that is followed by a guanine (G) as CpG, and, to ease notation, denote a pair of Cs followed by a pair of Gs as CCGG, rather than CpCpGpG). The pattern of digestion by such enzymes can provide a read-out of the DNA methylation. A disadvantage of this technique is the possibility of incomplete digestion and for some enzymes, the lack of site-specific resolution.

• Sodium bisulfite conversion: Treatment of DNA molecules with sodium bisulfite converts unmethylated Cs to uracils [24, 59]. Followed by a PCR step, the uracils are sequenced as thymines (Ts). This procedure enables the conversion of an epigenetic mark into a modification in the genomic sequence. Disadvantages of this technique include a reduction in sequence complexity (see details below), the need to chemically treat the DNA which can result in degradation, and the possibility of incomplete conversion.

• Affinity enrichment: In this process, the genome is digested, and the digest is enriched for methylated regions. This is achieved through the recognition of methylated regions by either antibodies [40] or methyl-CpG binding domain proteins [63] and the separation of DNA regions that were bound from those that were not. Disadvantages of this technique include binding biases of the enzymes and the lack of site-specific resolution.

Fig. 14.1 Three common techniques for genome-scale annotation of DNA methylation. (A) Enzyme digestion: the genomic DNA is digested with a methylation-sensitive restriction enzyme such as HpaII, which digests unmethylated CCGG sites. (B) Bisulfite conversion: converts cytosines that are not methylated to uracil. (C) Affinity enrichment: methylated cytosines in methylated regions are bound by antibodies or methyl-CpG binding proteins. M denotes methylation site. Reproduced in color in the color plate section.

Fig 14.1 Three common techniques for genome-scale annotation of DNA methylation. (A) Enzyme digestion: the genomic DNA is digested with a methylation-sensitive restriction enzyme such as HpaII, which digests unmethylated CCGG sites. (B) Bisulfite conversion: converts cytosines that are not methylated to uracil. (C) Affinity enrichment: methylated cytosines in methylated regions are bound by antibodies or methyl-CpG binding proteins. M denotes methylation site.

14.2.1 Assaying Genome-wide DNA Methylation

Methods for genome-wide measurement of DNA methylation generally use one of the aforementioned techniques coupled with either high-throughput sequencing or array hybridization. Potentially, methods may incorporate more than one of the techniques, but current approaches do not do so. These methods can be broadly classified as methyltyping versus methylome (p.367) sequencing, in analogy with genotyping versus genome sequencing for DNA. In genotyping, only a small subset of an individual’s nucleotides are assayed (SNP locations), while in genome sequencing the whole genome of an individual is sequenced. The advantage of genotyping is in its significantly lower cost compared to whole genome sequencing, allowing for the inclusion of many more individuals in an experiment of a fixed cost. Similarly, methyltyping technologies allow surveying of genome-scale methylation patterns by sampling a subset of CpG sites, and emphasize low cost at the expense of high resolution. In this section, we discuss different methods for methylome sequencing and methyltyping, along with the advantages and disadvantages associated with each method.

The different methods used for high-throughput genomics, including those for measuring DNA methylation, reduce to the ability of counting DNA fragments in a digest, obtained using either arrays or sequencing. Arrays are considered less accurate than sequencing for quantification purposes, mainly due to biases introduced in arrays by the variability in hybridization. Each of the techniques described can be followed by array hybridization, but affinity enrichment is by far the technique that is best suited for arrays [31]. The advantage of using arrays is the low cost, but disadvantages include the biases introduced in the hybridization step, as well as genome-scale array-based methods not being site-specific (there are array-based methods that use either enzyme digestion or bisulfite conversion and are site-specific, but since they sample a considerably smaller number of CpGs, we do not consider them as genome-scale methyltyping methods).

Advances in high-throughput sequencing in the past several years have brought about a large increase in the number of available methods for mapping DNA methylation on a genome-wide scale. Whole-genome bisulfite sequencing (WGBS) involves random digestion of the genome followed by bisulfite treatment, amplification, and sequencing, offering the ability to measure absolute levels of DNA methylation at a single-nucleotide resolution. While this procedure has been used in different studies [11, 34, 35, 62], its use is limited as it is the most expensive method for DNA methylation annotation, because it requires the sequencing of whole genomes. Moreover, sequencing a whole methylome is considerably more expensive than sequencing a whole genome because of the continuous nature of the methylation phenomenon (we would like to determine the proportion of cells in the digest in which a cytosine was methylated), requiring significantly higher coverage than genome sequencing. Therefore, this method cannot, in the foreseeable future, be used for large-scale comparison studies such as population studies and epigenetic association studies. Other disadvantages of the method that are not present in whole-genome sequencing include biases introduced in the polymerase chain reaction (PCR) amplification step and a reduction in sequence complexity. The PCR amplification step introduces biases that are due to unmethylated instances introducing AT-rich sequences. The sequence complexity is reduced because every T sequenced in a read could be mapped to either a T or a C in the reference genome, reducing the number of locations producing uniquely mapping reads. A second method for annotating methylation across the whole genome is that of affinity enrichment followed by high-throughput sequencing. This method has been used in [37], but requires extensive sequencing, is not site-specific, and is prone to binding biases.

The use of high-throughput sequencing has enabled several methods for methyltyping, two of which are discussed here: methyl-Seq and Reduced representation bisulfite sequencing (RRBS). In methyl-Seq the genome is digested with the methylation sensitive restriction enzyme HpaII that digests unmethylated CpGs that are within CCGG sites. This is followed by size selection of the fragments, a PCR amplification step, and sequencing the fragments’ ends. Mapping (p.368) sequenced reads back to a reference genome reveals unmethylated CpGs. methyl-Seq is a convenient methyltyping strategy because it is cost-effective due to the sequencing of only unmethylated sites (the minority of CpG sites in vertebrates [55]), requires only small amounts of material, and avoids bisulfite conversion. However, although the experiment is relatively simple, interpretation of the sequencing data is not straightforward due to the protocol resulting in a non-random segmentation of the genome which is followed by a size selection step. We describe the methyl-Seq protocol and the bias that is associated with it in greater detail in the next section.

A second methyltyping method, RRBS, is based on digestion with a methylation-insensitive enzyme followed by bisulfite sequencing [38]. The procedure is enzyme digestion followed by size selection of the fragments, bisulfite treatment, PCR amplification, and sequencing of the fragments’ ends. The digestion is commonly performed with MspI [22, 39], which digests CCGG sites, to enrich for regions of the genome that are rich in CpG sites. RRBS is favorable due to being significantly less expensive to perform than whole-genome bisulfite sequencing, but suffers from the same disadvantages related to methods that use bisulfite conversion: bias introduced by the PCR amplification step and reduced sequence complexity.

When designing an experiment in which DNA methylation is measured on a genome-wide scale, one must consider the trade-off between the cost of the method and the type of coverage it produces. In addition to this, the application and analysis of the different methods are complicated by a number of other issues. The major complications of the different methods based on the technologies they use were discussed above. On top of this, the analysis requirements for different assays vary in difficulty, and the rapid development of the sequencing field calls for frequent reassessment of the comparison between methods. For these reasons, there has been a proliferation of methods whose pros and cons are constantly changing as sequencing technologies change. In [31], it was suggested that methyl-Seq is the method with the most favorable profile of pros and cons, with respect to the measures chosen for comparison (see Table 2 of [31]). We compare here the different methods presented, given the current stage of the field, but the reader should keep in mind that the comparison criteria will change as the field develops and as new techniques are introduced.

Table 14.1 summarizes the main features by which methods are compared. It is divided into methyltyping methods and whole-genome methods. We have omitted a column for a cost comparison due to the changing nature of the field (sequencing costs are rapidly decreasing), but it seems certain that in the foreseeable future, whole-genome methods will require significantly more sequencing than methyl-Seq and RRBS and that RRBS will require more sequencing than methyl-Seq. Methyl-seq retrieves information spanning more of the genome than RRBS, because of a more favorable profile of fragment sizes produced by HpaII relative to MspI [49].

In the next section, we explain in detail the methyl-Seq experimental procedure and the need for a statistical model dedicated to the analysis of the data produced, motivating the subsequent sections in the chapter.

14.2.2 The methyl-Seq Method

In this section, we describe the methyl-Seq experiment and the experimental bias it introduces. In the methyl-Seq experiment, the genomic DNA is digested with a methylation-sensitive restriction enzyme; in our case, HpaII. HpaII digests at CCGG sites in which the second C is unmethylated. This step obtains a digest of DNA fragments such that at all of the fragment ends there is an unmethylated HpaII site, and all HpaII sites within a fragment (not including the ends) are (p.369) methylated (assuming complete digestion). Then, the fragments are size-selected using a run on an agarose gel, where the common length accepted is between 50 and 300 bps. This size selection is important to achieve good sequencing throughput when using Illumina machines. A paired-end library is constructed from the fragments that pass the size-selection step and, after a PCR amplification step, the ends of the fragments are sequenced. The paired sequenced reads are then mapped to a reference genome using an aligner that supports paired-end reads, such as Bowtie [32]. Notice that by constructing a paired-end library, the sequencing experiment returns pairs of sequences, where each pair is the product of sequencing the ends of one fragment. Using this protocol one can conclude which fragments where present in the digest.

Table 14.1 A summary of the characteristics of commonly used methods for methyltyping (top three) and whole methylome annotation (bottom two) in humans. For further information on the derivation of the values in the table see [49] and [31].

Pre-selected

Coverage of

Coverage of

# CpGs

Analysis

Site-specific

Regions

Human Genome

CpG Islands

Sampled

Challenges

methyl-Seq

yes

no

9.2%

92.9 %

~1.4 M

inference procedure needed

RRBS

yes

no

8.1%

69.8 %

~1.4 M

low complexity + PCR bias

Affinity-based Array

no

yes

pre-selected

pre-selected

-

binding biases

WGBS

yes

no

whole genome

whole genome

~28 M

low complexity + PCR bias

Affinity-based Seq

no

no

whole genome

whole genome

~28 M

binding biases + array biases

Note: RRBS, reduced representation bisulfite sequencing; WGBS, whole-genome bisulfite sequencing.

After the digestion step has completed, all unmethylated HpaII sites are present at the ends of digested fragments (see Fig. 14.1), but the size selection step required by the sequencing protocol limits sequencing to fragments of a narrow size range. This results in many cases in which one cannot determine the extent to which a site is methylated based on its read counts alone. Fig. 14.2 (page 370) shows three different types of epialleles and the fragments generated by them that pass the size selection step and are sequenced. When determining the methylation state of the highlighted site, cases 2 and 3 are indistinguishable if one considers only fragments originating at that site: in both cases, there are no fragments sequenced that have the highlighted site at either end. However, in case 2 we see that the sites on both sides of the highlighted site are unmethylated, and therefore if the highlighted site had been methylated, we would have sequenced a fragment 60 bps in length corresponding to the fragment in the middle (as sequenced in case 3). Since such a fragment was not sequenced, we can conclude that the site is unmethylated. In case 3, since the highlighted site was present in the interior of a fragment, we can conclude that it was methylated. While the examples in this figure assume binary methylation states, when relating to a population (p.370) of cells, methylation states are assumed to be continuous variables, because the methylation states can be heterogeneous even within a population of a single cell type [1, 64]. It can easily be seen that the examples above can be generalized to the case of continuous variables, where one cannot determine the extent to which a site is methylated given the read counts at that site alone, but can do so when making use of the fragments generated from the site’s neighborhood.

14.3 A Bayesian network for methyl-Seq Analysis

We have seen in the previous section that while methyl-Seq is a favorable method for methyltyping, the bias present in the method needs to be corrected for. When performing a methyl-Seq experiment, we would like to gain knowledge of the extent to which each HpaII site is methylated. More precisely, we would like to know for each HpaII site the proportion of cells in the digest that were methylated at that site. What we observe, however, is the number of times each fragment was sequenced in the experiment. Using our knowledge of the experimental procedure, we take a generative approach to model the procedure by which the methylation states of the HpaII sites impact the number of times each fragment is sequenced. On the basis of this generative model, we can infer the expected methylation states of the HpaII sites, given the observed fragment counts.

In the next section, we discuss how the methyl-Seq experiment can be modeled as a Bayesian network, and how that Bayesian network can be used to infer the methylation extents for all HpaII sites given the fragment counts. We begin by introducing a generative model for our data that is based on a Bayesian network. We then describe how the model parameters can be learned using the expectation-maximization (EM) algorithm, and how we use this together with the observed fragment counts to infer the methylation states of the HpaII sites. In the following section, we (p.371) discuss how expanding the model by adding an additional set of variables can achieve better site-specific inferences, as well as infer the location of unmethylated clusters.

14.3.1 Notation

In this chapter, we will denote random variables by uppercase letters, and a specific assignment to a random variable by the corresponding lowercase letter. For example, three coin tosses can be modeled by the random variables X1,X2, and X3 for the first, second, and third toss, respectively, and an assignment to X1 is denoted by x1. We will denote sets of random variables by bold uppercase letters, and an assignment to the set of random variables by the corresponding bold lowercase letter. For example, we would denote by X the three random variables X1,X2, and X3, and by x an assignment to all three variables.

14.3.2 A Generative Model

In this section, we describe a generative model for the methyl-Seq experiment. Given a reference genome, let F be the set of genomic sequences that have a HpaII site at each end. Notice that these regions may also have HpaII sites within them. We note that although theoretically F is the set of fragments that may be present in the digest of a methyl-Seq experiment, some fragments have probability of being sequenced which is essentially zero, like fragments spanning whole chromosomes, but we disregard that now to ease notation. For every HpaII site, we assign a random variable ${Yi}i=1,…,N∈[0,1]$, where N is the number of HpaII sites in the reference genome. Yi represents the proportion of cells from the digest in which site i was methylated.

In our generative model, we assume that at each step a cell is drawn at random, and the methylation state of each HpaII site is determined as 1 (methylated) or 0 (unmethylated) using Bernoulli draws with probability yi (in the case of diploid genomes, this sampling is done twice, once for each of the chromosomes). The methylation assignment to all HpaII sites of the chromosome determines what fragments will be generated from this cell. Let ${Zi}i=1,…,|F|∈[0,1]$ be random variables for the proportion of instances from which fragment fi was generated. In other words, let ${Zi}i=1,…,|F|$ be random variables for the proportion of times that the methylation configuration of the HpaII sites of a chromosome generated fragment fi. In this generative approach, $zi∼Nμi,μi(1−μi)wq$ where $μi=(1−ys)(1−ye)Πym$ and ys and ye are the sites on the upstream and downstream boundaries of fragment fi, ym are the sites between ys and ye, q is the number of cells in the digest, and w is the number of chromosome copies in each cell (for human w=2).1

Let ${Xi}i=1,…,|F|$ be random variables for the number of times fragment fi was sequenced (the number of paired-end reads that were sequenced from fragment fi). In our generative model, Xi follows a Poisson process in which the expected number of reads to be sequenced is $λi=ziθli$, where $θli$ is a factor determined by the length of the fragment, and depends on the specific experiment technicalities, such as the total amount of sequencing in the experiment. We condition Xi on the length of fi since, as described in the previous section, the size selection step in the experiment is not precise, and moreover, fragments of different lengths have different probabilities of being (p.372) sequenced, due to technicalities of the sequencing machinery. Fig. 14.3 shows the dependences of our generative model in a graphical model representation.

Fig. 14.2 The methylation state of a site cannot always be determined from the number of fragments that originated at that site. In many cases, the methylation state of a site cannot be determined from the extent to which it was present at the end of sequenced fragments but can be determined by integrating sequencing data from its neighborhood. bp, base pair, M, methylated; U, unmethylated. Reproduced in color in the color plate section.

Fig 14.2 The methylation state of a site cannot always be determined from the number of fragments that originated at that site. In many cases, the methylation state of a site cannot be determined from the extent to which it was present at the end of sequenced fragments but can be determined by integrating sequencing data from its neighborhood. bp, base pair, M, methylated; U, unmethylated.

Fig. 14.3 A Bayesian network representation for the generative model. Variables that are observed in the methyl-Seq experiment are filled in gray.

We have made a few simplifying assumptions in this generative model because they greatly simplify the dependences among variables and result in a tractable model that is convenient to work with for inference purposes. The first simplifying assumption we make is that the methylation states of neighboring HpaII sites are independent. Second, in contrast to many generative models of RNA sequencing (RNA-seq) [42], in this model we assume Xi, the number of times fragment fi is sequenced, depends on the proportion of cells from which fi was generated (Zi), but not on the relative proportion to which fi is present in the digest $zi∑j=1|F|zj$. We allow ourselves to make this assumption in the methyl-Seq case because, in RNA-seq, the contribution of each fragment to the digest is bounded by the number of cells (each cell can contribute at most $w⋅fi$ fragments), making the differences between fragment frequencies much smaller than in RNA-seq experiments. This, together with the fact that in methyl-Seq the number of possible fragments (contributors) is much larger than the number of transcripts in RNA-seq experiments, brought us to relax the dependence on the relative proportion of fi in the solution.

14.3.3 Parameter Learning and Inference of Posterior Probabilities

Now that we have a model in place, recall that our objective is to infer the posterior probability for each Yi, given the observed X and L variables. We have described a generative model by which we assume the Xi’s are generated (and are dependent on the Yi’s). In this section, we will describe how we can use learning and inference algorithms to attain estimates of the Y variables given the experimental results. For this purpose we will make a few simplifying changes to our model: we (p.373) discretize the Y variables such that ${Yi}i=1,…,N∈{0.1,0.3,0.5,0.7,0.9}$, and we assume that the Z variables are determined deterministically by the Y variables, such that $zi=μi$.

To summarize, the list of variables, parameters and dependences in our model is as follows:

• Variables:

• $Y1,…,YN∈{0.1,0.3,0.5,0.7,0.9}$: for each HpaII site, the proportion of cells in the digest that are methylated.

• $Z1,…,Z|F|∈[0,1]$: the proportion of cells from which fragment fi was generated.

• $L1,…,L|F|∈{1,…,lmax}$: the length in bps of fragment fi.

• $X1,…,X|F|∈{1,…,xmax}$: the number of times fragment fi was sequenced.

Parameters:

• $θL=(θ1−20,θ21−40,…,θ381−400,θ≥400)$.

Dependence functions:

• $zi=(1−ysi)(1−yei)Πj=si+1ei−1yj$, where si and ei are the indexes of the locations at which fi begins and ends.

• $P(xi|zi,li)=λxiexp{−λ}xi!$, where $λ=zif(li)$, and $f:{1,…,lmax}→θL$ is a function that takes in li and returns θ‎se such that $s≤li≤e$.

In practice, fragments that are very short or very long relative to the boundaries of the size selection step have probability zero of being observed, and we can omit the corresponding variables from the model. This significantly reduces the number of variables in the model, and the complexity of parameter learning and inference procedures.

Although we listed ${Zi}i=1,…,|F|∈[0,1]$ for convenience, we note that when the Y variables are discrete, so are the Z variables. The number of variables a specific Zi is directly dependent on is determined by the number of HpaII sites that are on the fragment Zi represents. Due to the relatively small size-selection bounds (50–300 bps) and the HpaII restriction site being of length four (CCGG), the majority of Z variables are directly dependent on a small number of Y variables. This is important from practical reasons, because when conducting the inference step in the EM approach (see below), the Bayesian network is moralized,2 and the running time of the inference procedure is dependent on the sizes of the cliques introduced (from a computational complexity perspective, the running time is exponential in the size of the largest clique). To avoid our moralized graph from encompassing large cliques we determine a priori a maximal number of parents we allow for any Zi variable. Variables with more parents than allowed are not incorporated into the model. While this results in ignoring data from fragments that encompass too many HpaII sites, it has a crucial impact on lowering the computational complexity of the inference procedure.

The direct output of the experiment is a list of paired sequenced reads. Each pair of reads is mapped to a reference genome to determine which genomic fragment it originated from (see SubSection 14.2.2). In practice, some of the pairs cannot be mapped uniquely to one fragment of the genome. To resolve this, one may add another layer to the model, taking into account the uncertainty regarding which fragment the reads originated from. However a simpler approach is to disregard from the model fragments that can generate pairs of reads that do not uniquely map.

(p.374) In the same manner that we have a variable determining the length of each fragment, we can consider additional variables for characteristics that affect the extent of sequencing (such as GC content [2, 12]). However one must take into account that in the current setting, this will result in an exponential growth in the number of parameters.

The EM Approach

We would like to compute for each Yi its posterior distribution given the observed data. Namely, we are interested in $P(Yi|X,L)$ for all Yi variables. Given some parameter assignment to θ‎L, we can compute the posterior probabilities $P(Yi|X,L,θL)$ in time that is exponential in the size of the largest clique of the moralized graph, using the junction tree algorithm [29]. This computation however requires a parameter assignment, which we do not know.

On the other hand, if we were to observe a full assignment of all variables in the Bayesian network (Y,Z,L, and X), let such a data set be $D$, we could learn a set of parameters for our model using the maximum likelihood approach by finding the set of parameters that maximizes the likelihood function relative to $D$. We give an outline of these computations here, and leave the details for Appendix 14.8 (page 14.8). From the Bayesian network structure we can construct the likelihood function:

$Display mathematics$

assuming a uniform prior over Y and L and that the probability of $D$ is not zero (that all zi assignments are as expected given the yi assignments). The function f(li) is as defined in the “dependence function” annotation. The log-likelihood is therefore:

$Display mathematics$

where θ‎L(j) is the jth element of θ‎L and C is the part of the equation with elements that do not include instances from θ‎L. We would like to find θ‎L which maximizes the likelihood function. Each θ‎L(j) can be optimized separately, and by differentiating the log-likelihood, setting to zero and solving for θ‎L(j), we get the maximum likelihood parameters:

$Display mathematics$

To conclude, in the presence of a set of parameters we can compute the posterior distribution for the Ys, and in the presence of a fully observed data set, we can compute the maximum likelihood estimator for the parameters. This is a good setting to use the EM algorithm, which returns a parameter assignment for our model, given the observed data. Given an initial assignment of (arbitrary) parameters, the algorithm computes the expected values of the numerator and denominator used for the maximum likelihood parameter estimation. It then updates the parameters of the model using the expected values computed. This algorithm is guaranteed to converge to a (p.375) stationary point of the log-likelihood function. We start with some (possibly random) parameter assignment, $θL1$, and iterate between the following two steps:

• Expectation step (E-step): In this step the algorithm uses the parameters from the previous maximization step (M-step), $θLt$, to compute the expected values for the sufficient statistics used in the maximum likelihood parameter estimation. The expected values are computed for sufficient statistics that incorporate “hidden” (unobserved) variables of the model. The expected sufficient statistic that needs to be computed for θ‎L(j) is

$Display mathematics$

and can be computed using the posterior distributions obtained by the junction-tree algorithm on the moralized graph.

Maximization step (M-step): In this step, we use the expected sufficient statistics from the E-step to perform the maximum likelihood estimation:

$Display mathematics$

The algorithm iterates between the E-step and the M-step until the increase in the likelihood function is smaller than a given threshold. After the algorithm has converged at $θLf$, we can compute for each Yi its posterior distribution using a final run of the junction tree algorithm.

14.4 Genomic Structure as a Prior on Methylation Status

In the generative model described in the previous section, we did not assume any prior knowledge about the methylation states of different HpaII sites. However, it is well known that one can make use of the genomic sequence to gain some information regarding the probability that a specific HpaII site is methylated. This is because in vertebrates, unmethylated sites tend to cluster together and CpG sites that are constitutively unmethylated are more conserved than other CpG sites. This results in the effect that regions that tend to be unmethylated are more CpG rich than regions that tend to be methylated [4]. There are several methods to annotate such CpG-rich regions, based on genomic sequences, in an effort to find regions that have interesting DNA methylation behavior. Such regions are called “CpG islands”.

The precise definition of CpG island regions and their methylation states is a challenging task that has been the subject of much research throughout the years [4, 16, 17, 25, 50, 54, 53, 56, 61]. This is not surprising if we recall that DNA methylation states are more dynamic than the DNA sequence. For example, DNA methylation states may be different across the different tissues of (p.376) an individual [26, 53]. Thus, a purely sequence based definition of a “CpG island” is problematic. Nonetheless, in the case of experimental annotation of unmethylated sites, clusters of CpGs provide evidence against methylation that should ideally be incorporated into our model.

We therefore add a hidden random variable for each HpaII site, ${Wi}i=1,…,N∈{1,0}$, indicating whether the ith HpaII site is in an unmethylated cluster or not. We note that in our setting we assume that there are unmethylated clusters in the experiment, and the W variables denote the presence of a HpaII site in such a cluster. Different methyl-Seq experiments, on different tissues for example, may have a different set of unmethylated clusters. In addition, we add for each HpaII site two observed variables, denoted by C and D. The notation ${Ci}i=1,…,N∈{1,…,cmax}$ indicates the number of CpG sites between the (i−1)th HpaII site and the ith HpaII site (any CpGs that are not part of a CCGG site). The notation ${Di}i=1,…,N∈{1,…,dmax}$ indicates the distance in bps between the (i−1)th HpaII site and the ith HpaII site. A graphical model representation of the model augmented with these additions can be seen in Fig. 14.4. The list of variables, parameters, and dependence functions added to our model is the following:

Fig. 14.4 A Bayesian network representation for our generative model incorporating genomic structure. Variables that are observed in the methyl-Seq experiment are filled in gray.

• Variables:

• $W1,…,WN∈{1,0}$: indicates for HpaII site i whether it is present in an unmethylated island (1) or not (0).

• (p.377) $C1,…,CN∈{1,…cmax}$: the number of CpG sites between the previous and current HpaII sites.

• $D1,…,DN∈{1,…,dmax}$: the distance between the previous and current HpaII sites.

Parameters:

• $θY|W={θy0|w0,…,θy4|w0,θy0|w1,…,θy4|w1}$, \\ where $0≤θyi|wj≤1$, $∑iθyi|w0=1$ and $∑iθyi|w1=1$.

• $θW|W−1,C,D={θwa|w−1b,c(lC,hC),d(lD,hD)}$ where \ $a,b∈{0,1},$ \ $(lC,hC)∈{(0,2),…,(2k−1+1,2k),…,(513,1024),(≥1025)}$, \ $(lD,hD)∈{(0,5),…,(5⋅2k−1+1,5⋅2k),…,(1280,2560),(≥2561)}$, \\ and w−1 denotes the states of the W variable that is the parent of the W variable at hand. \ For all indexes, $0≤θwa|w−1b,c(lC,hC),d(lD,hD),≤1$, and \ $∑t={0,1}θwt|w−1b,c(lC,hC),d(lD,hD)=1$.

Dependence functions:

• All of the new dependence functions are fully described by the parameters added. The way in which every variable is dependent on its parents is fully captured by the parameters specified.

Applying the EM Algorithm

In order to make inferences from this model, for the same reasons discussed in the previous section, we make use of the EM algorithm. Assuming we have a fully observed data set, we can use the likelihood function as in the previous section to derive the maximum likelihood estimators for our model parameters:

$Display mathematics$
$Display mathematics$

and

$Display mathematics$

where $f1:{1,…cmax}→(lC,hC)$ takes in ci and returns a pair, $(lC,hC)$, such that $lC≤ci≤hC$, and $f2:{1,…,dmax}→(lD,hD)$ takes in di and returns a pair, (lD,hD), such that $lD≤di≤hD$.

As before, we start with some (possibly random) assignment of the parameters, and iterate between the following two steps until we determine convergence. For convenience, we denote by θ‎ the complete parameter set for our model.

• Expectation step (E-step): In this step, the algorithm uses the parameters from the last M-step, θ‎t, to compute the expected values for the sufficient statistics used in the maximum likelihood parameter estimation (that incorporate “hidden” variables). For $θL(i)$, this is done as previously described. Let O = (X, L, C, D) be the set of observed variables, and H = (W, Y, Z) be the set of hidden (p.378) variables. Using the junction-tree algorithm on the moralized graph, we can compute the probability of any assignment to some variable, Q, and its parents, U, given the observed data and θ‎t. In other words, after a run of the junction-tree algorithm, we know $P(q,u|O,θt)$ for every assignment of Q and U, (q u). We can then compute the needed expected sufficient statistics. For $θya|wb$ we compute

$Display mathematics$

and

$Display mathematics$
For $θwa|w−1b,c(lC,hC),d(lD,hD)$ we compute
$Display mathematics$

and

$Display mathematics$

We recall that the running time complexity of the junction-tree algorithm is exponential in the size of the maximum sized clique of the moralized graph, and it is easy to see how the moralized network can be chorded such that, aside from the cliques generated in the previous model, only cliques of size three are introduced.

• Maximization step (M-step): In this step we generate an updated set of parameters, $θt+1$, by using the expected sufficient statistics computed in the E-step. The computation of $θLt+1$ remains the same as before, and $θya|wbt+1$ and $θwa|w−1b,dlD,hD,clC,hCt+1$ are updated as follows:

$Display mathematics$

(p.379) and

$Display mathematics$

The algorithm iterates between the E-step and the M-step until the increase in the likelihood function is smaller than a given threshold. After the algorithm has converged at θ‎f, we can compute the posterior distribution for each Yi by using a final run of the junction tree algorithm. We can now also compute for every Wi the probability that the HpaII site at index i is in an unmethylated cluster. We elaborate on this in the next section.

14.5 Application: Methyltyping the Human Neutrophil

In the previous sections, we have described a Bayesian network for the correction of bias introduced in high-throughput sequencing experiments in which the genome is digested in a non-random manner followed by a size-selection step, and have focused on the methyl-Seq experiment. In this chapter, we present in detail results from a study that has applied such a model for the purpose of characterizing the methyltypes of neutrophil cells from four human individuals [49].

In [49], the authors show that the methyltyping approach applied by methyl-Seq coupled with a Bayesian network correction procedure, called MetMap, is sufficient to survey methylation states across the genome and provides significant insight into the methylome, inside and outside of CpG islands, at site-specific resolution. In the study at hand, the methyltypes of four male human individuals were determined, using the standard methyl-Seq protocol. Site-specific methylation values could be assigned to 4.8% of the CpGs in the human reference genome, and of these, 20% were inside CpG islands.

The authors show that the precision achieved by using MetMap’s correction was greatly increased over that of using the raw read counts. This was shown by determining the true methylation state of 46 HpaII sites using direct bisulfite sequencing and observing that the Pearson correlation of the MetMap inferred scores with the bisulfite validations, which was 0.90, was significantly higher than the Pearson correlation of the methylation values estimated by the raw fragment counts with the bisulfite validations, which was 0.67.

14.5.1 Unmethylated Clusters

As we have discussed throughout this chapter, unmethylated sites in vertebrate genomes tend to be clustered together. While this enables predicting locations of such clusters from the genomic sequence, if experimental methylation data are available, it is desirable to annotate unmethylated islands directly. This is because DNA methylation varies across tissues and conditions, and experimental data will reveal a more precise picture of the methylation state for the specific experiment.

In [49], the authors identified experiment-specific unmethylated islands, and called them SUMIs (strongly unmethylated islands). These unmethylated clusters are specific to the (p.380) experiment at hand, and therefore to the human neutrophil. An interesting test was to investigate how many of the SUMIs found in the neutrophil experiment were also present in the regions inferred as unmethylated by sequence-based methods (see Fig. 14.5). The authors compared the 20 986 SUMIs they found with regions inferred to be unmethylated by two sequence-based methods: CpG islands found using the University of California, Santa Cruz (UCSC) browser (termed UCSC islands) and “bona fide” CpG islands (termed BF islands) [5]. The UCSC islands are the sequence-based set of islands that are most commonly used, while the BF islands were annotated using a more sophisticated, and potentially more exact, method. Of the 20 986 SUMIs present in at least one of the four individuals, 4 652 do not overlap the UCSC islands, and 7 055 do not overlap the BF islands.

Fig. 14.5 A section of the genome showing site-specific methylation scores (top panel) and unmethylated clusters (SUMIs, second panel) as inferred from one of human neutrophil samples. For the site-specific scores, a score of 0 determines a site as fully methylated. The third and fourth panels show BF islands as annotated by [5] and UCSC islands, respectively. While there is substantial overlap between SUMIs and the islands inferred by the sequence-based methods, a few novel SUMIs are seen in this figure, one of them at a transcription start site. RefSeq denotes genes annotated in the National Center for Biotechnology Information reference sequence database. Reproduced in color in the color plate section.

Fig 14.5 A section of the genome showing site-specific methylation scores (top panel) and unmethylated clusters (SUMIs, second panel) as inferred from one of human neutrophil samples. For the site-specific scores, a score of 0 determines a site as fully methylated. The third and fourth panels show BF islands as annotated by [5] and UCSC islands, respectively. While there is substantial overlap between SUMIs and the islands inferred by the sequence-based methods, a few novel SUMIs are seen in this figure, one of them at a transcription start site. RefSeq denotes genes annotated in the National Center for Biotechnology Information reference sequence database.

The authors could validate the sensitivity and specificity of the SUMI set, compared to the sequence-based methods, by using direct bisulfite sequencing. Using this method, it was possible to annotate the methylation states of all the CpG sites at stretches of several hundred bps. As a validation of specificity, five regions that were annotated as part of both a UCSC island and a BF island, and did not overlap with SUMIs, were bisulfite sequenced. All regions were found to be methylated in the neutrophil sample. As a validation of sensitivity, four regions that overlapped SUMIs and UCSC islands but not BF islands were bisulfite sequenced. All regions were found to be unmethylated in the neutrophil sample. These results demonstrated that the true methylation state in a specific tissue may be different than the methylation state inferred by the genome sequence alone, and perhaps of the methylation state present in other cell types.

Interestingly, 3 797 SUMIs did not overlap with BF islands or UCSC islands, revealing new regions that are unmethylated in neutrophil cells. These regions showed strong association with regions associated with other functional features, such as open chromatin, high sequence conservation, gene regions, and the 5ʹ end of genes. Table 14.2 displays the association of the different sets of islands with these features. The authors point out that the proportion of SUMIs that map at a distance from transcription start sites is larger for novel SUMIs than for all SUMIs, but that novel SUMIs have a degree of association with open chromatin similar to that observed for all SUMIs; this suggests that novel SUMIs may often represent distal regulatory sequences.

Table 14.2 Percentages of neutrophil strongly unmethylated islands, CpG islands annotated by the University of California, Santa Cruz browser, and bona fide CpG islands that overlap regions associated with functionality. For details on how the functional regions were determined, see [49].

SUMIs (%)

UCSC CpG Islands (%)

BF Islands (%)

Novel SUMIs (%)

Open chromatin

70.0%

52.9%

65.3%

61.0%

Conservation

71.1%

68.5%

76.2%

49.6%

Gene regions

76.9%

77.7%

79.7%

59.9%

TSS regions

59.8%

52.2%

61.4%

22.0%

Note: BF, bona fide; SUMI, strongly unmethylated islands; TSS, transcription start site; UCSC, University of California, Santa Cruz.

(p.381) 14.6 Conclusions

In this chapter, we have demonstrated how Bayesian networks can be utilized in the genome-scale study of DNA methylation. By designing a generative model that incorporates hidden variables for the biological states and observed variables for the generated experimental data we were able to predict the biological states, even when such inference was not straightforward.

The need for a Bayesian network comes from the experiment of interest (methyl-Seq) encompassing a non-random digestion of the genome followed by a size-selection step. Bayesian networks are suitable for such a case because they can be used to model the dependences between neighboring sites. The model presented can be adjusted to account for similar biases in several other methyltyping methods including the use of several different methylation-sensitive restriction enzymes. It is also expected to prove useful in other high-throughput sequencing experiments that are prone to such biases, for example some of the recent protocols developed for determining RNA secondary structure [58].

Acknowledgments

We thank Dario Boffelli and David I.K. Martin for many extensive discussions on DNA methylation. M.S. and L.P. were funded in part by NIH grant R01 HG006129-01.

Appendix 14.A Computations Used For The Em Approach

In this appendix, we give a detailed description of the computations involved in annotating the maximum likelihood parameters, given an assignment to all states of the Bayesian network shown in Fig. 14.3. Given some assignment ${y,z,l,x}$ to the variables Y Z L and X the Bayesian network implies the likelihood function:

(p.382)

$Display mathematics$

where Q is the constant for $Πj=1NP(yj)$, W is the constant for $Πi=1|F|P(li)$ and we assume that the probability of $D$ is not zero ($P(zi|y)=1$ for all Zi of $D$). The function f(li) is as defined in the “dependence function” annotation. Notice that in this setting, we are assuming a full assignment to all random variables of the model, and therefore terms that do not consider parameters can be referred to as constants. Our goal is to find the assignment of parameters that maximize the above likelihood function. We can do this by finding the parameter assignment that maximizes the log-likelihood:

$Display mathematics$

where θ‎L(k) is the kth element of θ‎L and C is the part of the equation with elements that do not include instances from θ‎L. We would like to find θ‎L that maximizes the likelihood function, which we denote by $θˆL$. We denote the jth element of $θˆL$ by $θˆL(j)$. For each j we can find $θˆL(j)$ by differentiating the log-likelihood, setting to zero and solving for θ‎L(j):

$Display mathematics$

Solving this for θ‎L(j), we get:

$Display mathematics$

(p.383) References

Bibliography references:

[1] M.P. Ball, J.B. Li, Y. Gao, J. Lee, E.M. Leproust, I. Park, B. Xie, G. Q. Daley, and G.M. Church. Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells. Nature Biotechnology, 27(4):361–368, 2009.

[2] Y. Benjamini and T.P. Speed. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research, 40(10): e72, 2012.

[3] B.P. Berman, D.J. Weisenberger, J.F. Aman, T. Hinoue, Z. Ramjan, Y. Liu, H. Noushmehr, C.P.E. Lange, C.M. van Dijk, R.A.E.M. Tollenaar, D. Van Den Berg, and P.W. Laird. Regions of focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina associated domains. Nature Genetics, 44(1):40–46, 2011.

[4] A.P. Bird. CpG-rich islands and the function of DNA methylation. Nature, 321(6067):209–213, 1986.

[5] C. Bock, J. Walter, M. Paulsen, and T. Lengauer. CpG island mapping by epigenome prediction. PLOS Computational Biology, 3(6):e110, 2007.

[6] J. Borgel, S. Guibert, Y. Li, H. Chiba, D. Schübeler, H. Sasaki, T. Forné, and M. Weber. Targets and dynamics of promoter DNA methylation during early mouse development. Nature Genetics, 42(12):1093–1100, 2010.

[7] C.V. Breton, H.M. Byun, M. Wenten, F. Pan, A. Yang, and F.D. Gilliland. Prenatal tobacco smoke exposure affects global and gene-specific DNA methylation. American Journal of Respiratory and Critical Care Medicine, 180(5):462–467, 2009.

[8] E.I. Campos and D. Reinberg. Histones: annotating chromatin. Annual Review of Genetics, 43(1):559–599, 2009.

[9] H. Cedar and Y. Bergman. Epigenetics of haematopoietic cell development. Nature Reviews Immunology, 11(7):478–488, 2011.

[10] N.M. Cohen, V. Dighe, G. Landan, S. Reynisdóttir, A. Palsson, S. Mitalipov, and A. Tanay. DNA methylation programming and reprogramming in primate embryonic stem cells. Genome Research, 19(12):2193–2201, 2009.

[11] S.J. Cokus, S. Feng, X. Zhang, Z. Chen, B. Merriman, C.D. Haudenschild, S. Pradhan, S.F. Nelson, M. Pellegrini, and S.E. Jacobsen. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature, 452:215–219, 2008.

[12] J.C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research, 36(16):e105, 2008.

[13] M. Ehrlich. DNA methylation in cancer: too much, but also too little. Oncogene, 21(35):5400–5413, 2002.

[14] J. Ernst, P. Kheradpour, T.S. Mikkelsen, N. Shoresh, L.D. Ward, C.B. Epstein, X. Zhang, L. Wang, R. Issner, M. Coyne, M. Ku, T. Durham, M. Kellis, and B.E. Bernstein. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345):43–49, 2011.

[15] M.F. Fraga, E. Ballestar, M.F. Paz, S. Ropero, F. Setien, M.L. Ballestar, D. Heine-Suner, J.C. Cigudosa, M. Urioste, J. Benitez, M. Boix-Chornet, A. Sanchez-Aguilera, C. Ling, E. Carlsson, P. Poulsen, A. Vaag, Z. Stephan, T.D. Spector, Y.Z. Wu, C. Plass, and M. Esteller. Epigenetic differences arise during the lifetime of monozygotic twins. Proceedings of the National Academy of Sciences of the United States of America, 102(30):10604–10609, 2005.

[16] M. Gardiner-Garden and M. Frommer. CpG islands in vertebrate genomes. Journal of Molecular Biology, 196(2):261–282, 1987.

[17] J.L. Glass, R.F. Thompson, B. Khulan, M.E. Figueroa, E.N. Olivier, E.J. Oakley, G. Van Zant, E.E. Bouhassira, A. Melnick, A. Golden, M.J. Fazzari, and J.M. Greally. CG dinucleotide clustering is a species-specific property of the genome. Nucleic Acids Research, 35(20):6798–6807, 2007.

[18] (p.384) M.G. Goll and T.H. Bestor. Eukaryotic cytosine methyltransferases. Annual Review of Biochemistry, 74(1):481–514, 2005.

[19] A. Goren and H. Cedar. Replicating by the clock. Nature Reviews Molecular Cell Biology, 4(1):25–32, 2003.

[20] R.T. Grant-Downton and H.G. Dickinson. Epigenetics and its implications for plant biology 2. The ‘epigenetic epiphany’: epigenetics, evolution and beyond. Annals of Botany, 97(1):11–27, 2006.

[21] S.I.S. Grewal and D. Moazed. Heterochromatin and epigenetic control of gene expression. Science, 301(5634):798–802, 2003.

[22] H. Gu, C. Bock, T.S. Mikkelsen, N. Jager, Z.D. Smith, E. Tomazou, A. Gnirke, E.S. Lander, and A. Meissner. Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature Methods, 7(2):133–136, 2010.

[23] I.R. Henderson and S.E. Jacobsen. Epigenetic inheritance in plants. Nature, 447(7143):418–424, 2007.

[24] H. Hikoya. Discovery of bisulfite-mediated cytosine conversion to uracil, the key reaction for DNA methylation analysis—a personal account. Proceedings of the Japan Academy, Series B, 84(8):321–330, 2008.

[25] F. Hsieh, S.C. Chen, and K. Pollard. A nearly exhaustive search for CpG islands on whole chromosomes. The International Journal of Biostatistics, 5(1):14, 2009.

[26] R. Illingworth, A. Kerr, D. Desousa, H. Jørgensen, P. Ellis, J. Stalker, D. Jackson, C. Clee, R. Plumb, J. Rogers, S. Humphray, T. Cox, C. Langford, and A. Bird. A novel CpG island set identifies tissue-specific methylation at developmental gene loci. PLOS Biology, 6(1):e22, 2008.

[27] H. Ji, L.I.R. Ehrlich, J. Seita, P. Murakami, A. Doi, P. Lindau, H. Lee, M.J. Aryee, R.A. Irizarry, K. Kim, D.J. Rossi, M.A. Inlay, T. Serwold, H. Karsunky, L. Ho, G.Q. Daley, I.L. Weissman, and A.P. Feinberg. Comprehensive methylome map of lineage commitment from haematopoietic progenitors. Nature, 467(7313):338–342, 2010.

[28] P.A. Jones and S.B. Baylin. The epigenomics of cancer. Cell, 128(4):683–692, 2007.

[29] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.

[30] T. Kouzarides. Chromatin modifications and their function. Cell, 128(4):693–705, 2007.

[31] P. W. Laird. Principles and challenges of genome-wide DNA methylation analysis. Nature Reviews Genetics, 11(3):191–203, 2010.

[32] B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25, 2009.

[33] E. Li, T. H. Bestor, and R. Jaenisch. Targeted mutation of the DNA methyltransferase gene results in embryonic lethality. Cell, 69(6):915–926, 1992.

[34] R. Lister, R. C. O’Malley, J. Tonti-Filippini, B. D. Gregory, C. C. Berry, H. A. Millar, and J. R. Ecker. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133(3):523–536, 2008.

[35] R. Lister, M. Pelizzola, R.H. Dowen, R.D. Hawkins, G. Hon, J. Tonti-Filippini, J.R. Nery, L. Lee, Z. Ye, Q.M. Ngo, L. Edsall, J. Antosiewicz-Bourget, R. Stewart, V. Ruotti, A.H. Millar, J.A. Thomson, B. Ren, and J.R. Ecker. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462(7271):315–322, 2009.

[36] D.I. Martin, M. Singer, J. Dhahbi, G. Mao, L. Zhang, G. P. Schroth, L. Pachter, and D. Boffelli. Phyloepigenomic comparison of great apes reveals a correlation between somatic and germline methylation states. Genome Research, 21(12):2049–2057, 2011.

[37] A.K. Maunakea, R.P. Nagarajan, M. Bilenky, T.J. Ballinger, C. D’Souza, S.D. Fouse, B.E. Johnson, C. Hong, C. Nielsen, Y. Zhao, G. Turecki, A. Delaney, R. Varhol, N. Thiessen, K. Shchors, V.M. Heine, D.H. Rowitch, X. Xing, C. Fiore, M. Schillebeeckx, S.J.M. Jones, D. Haussler, M.A. Marra, M. Hirst, T. Wang, and J.F. Costello. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature, 466(7303):253–257, 2010.

[38] (p.385) A. Meissner, A. Gnirke, G.W. Bell, B. Ramsahoye, E.S. Lander, and R. Jaenisch. Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Research, 33(18):5868–5877, 2005.

[39] A. Meissner, T. S. Mikkelsen, H. Gu, M. Wernig, J. Hanna, A. Sivachenko, X. Zhang, B. E. Bernstein, C. Nusbaum, D. B. Jaffe, A. Gnirke, R. Jaenisch, and E. S. Lander. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature, 454:766–770, 2008.

[40] F. Mohn, M. Weber, D. Schübeler, and T.C. Roloff. Methylated DNA Immunoprecipitation (MeDIP). In J. M. Walker and J. Tost, editors, DNA Methylation, volume 507 of Methods in Molecular Biology, pages 55–64. Humana Press, 2009.

[41] H.D. Morgan, H.G. Sutherland, D.I. Martin, and E. Whitelaw. Epigenetic inheritance at the agouti locus in the mouse. Nature Genetics, 23(3):314–318, 1999.

[42] L. Pachter. Models for transcript quantification from RNA-Seq. arXiv:1104.3889v2, 2011.

[43] C. A. Perry, C. D. Allis, and A. T. Annunziato. Parental nucleosomes segregated to newly replicated chromatin are underacetylated relative to those assembled de novo. Biochemistry, 32(49):13615–13623, 1993.

[44] V.K. Rakyan, S. Chong, M.E. Champ, P.C. Cuthbert, H.D. Morgan, K.V.K. Luu, and E. Whitelaw. Transgenerational inheritance of epigenetic states at the murine Axin(Fu) allele occurs after maternal and paternal transmission. Proceedings of the National Academy of Sciences of the United States of America, 100(5):2538–2543, 2003.

[45] V.K. Rakyan, T.A. Down, D.J. Balding, and S. Beck. Epigenome-wide association studies for common human diseases. Nature Reviews Genetics, 12(8):529–541, 2011.

[46] K.D. Robertson. DNA methylation and human disease. Nature Reviews Genetics, 6(8):597–610, 2005.

[47] R.J. Schmitz, M.D. Schultz, M.G. Lewsey, R.C. O’Malley, M.A. Urich, O. Libiger, N.J. Schork, and J.R. Ecker. Transgenerational epigenetic instability is a source of novel methylation variants. Science, 334(6054):369–373, 2011.

[48] Y. Shufaro, O. Lacham-Kaplan, B.Z. Tzuberi, J. McLaughlin, A. Trounson, H. Cedar, and B.E. Reubinoff. Reprogramming of DNA replication timing. Stem Cells, 28(3):443–449, 2010.

[49] M. Singer, D. Boffelli, J. Dhabhi, A. Schönhuth, G.P. Schroth, D.I. Martin, and L. Pachter. MetMap enables genome-scale methyltyping for determining methylation states in populations. PLOS Computational Biology, 6(8):e1000888, 2010.

[50] M. Singer, A. Engström, A. Schönhuth, and L. Pachter. Determining coding CpG islands by identifying regions significant for pattern statistics on Markov chains. Statistical Applications in Genetics and Molecular Biology, 10(1):43, 2011.

[51] M. Slatkin. Epigenetic inheritance and the missing heritability problem. Genetics, 182(3):845–850, 2009.

[52] B. D. Strahl and C. D. Allis. The language of covalent histone modifications. Nature, 403(6765):41–45, 2000.

[53] R. Straussman, D. Nejman, D. Roberts, I. Steinfeld, B. Blum, N. Benvenisty, I. Simon, Z. Yakhini, and H. Cedar. Developmental programming of CpG island methylation profiles in the human genome. Nature Structural and Molecular Biology, 16(5):564–571, 2009.

[54] Y. Sujuan, A. Asaithambi, and Y. Liu. CpGIF: an algorithm for the identification of CpG islands. Bioinformation, 2(8):335–338, 2008.

[55] M.M.M. Suzuki and A. Bird. DNA methylation landscapes: provocative insights from epigenomics. Nature Reviews Genetics, 9:465–476, 2008.

[56] D. Takai and P.A. Jones. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proceedings of the National Academy of Sciences of the United States of America, 99(6):3740–3745, 2002.

[57] B.M. Turner. Defining an epigenetic code. Nature Cell Biology, 9(1):2–6, 2007.

[58] J.G. Underwood, A.V. Uzilov, S. Katzman, C.S. Onodera, J.E. Mainzer, D.H. Mathews, T.M. Lowe, S.R. Salama, and D. Haussler. FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nature Methods, 7(12):995–1001, 2010.

[59] (p.386) R.Y. Wang, C.W. Gehrke, and M. Ehrlich. Comparison of bisulfite modification of 5-methyldeoxycytidine and deoxycytidine residues. Nucleic Acids Research, 8(20):4777–4790, 1980.

[60] M. Weber, I. Hellmann, M.B. Stadler, L. Ramos, S. Pääbo, M. Rebhan, and D. Schübeler. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nature Genetics, 39(4):457–466, 2007.

[61] H. Wu, B. Caffo, H.A. Jaffee, R.A. Irizarry, and A.P. Feinberg. Redefining CpG islands using Hidden Markov Models. Biostatistics, 11(3):499–514, 2010.

[62] A. Zemach, I.E. McDaniel, P. Silva, and D. Zilberman. Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science, 328(5980):916–919, 2010.

[63] X. Zhang, J. Yazaki, A. Sundaresan, S. Cokus, S.W.L. Chan, H. Chen, I.R. Henderson, P. Shinn, M. Pellegrini, S.E. Jacobsen, and J.R. Ecker. Genome-wide high-resolution mapping and functional analysis of DNA methylation in Arabidopsis. Cell, 126(6):1189–1201, 2006.

[64] Y. Zhang, C. Rohde, S. Tierling, T.P. Jurkowski, C. Bock, D. Santacruz, S. Ragozin, R. Reinhardt, M. Groth, J. Walter, and A. Jeltsch. DNA methylation analysis of chromosome 21 gene promoters at single base pair and single allele resolution. PLOS Genetics, 5(3):e1000438, 2009.

Notes:

(1) The number of occurrences of fragment fi in the solution follows a binomial distribution $B(wq,μi)$, for which $N(wqμi,wqμi(1−μi))$ is a good approximation. Therefore, the proportion of samples from which fi was generated can be approximated by the distribution $Nμi,μi(1−μi)wq$.

(2) In the moralization step, edges are added between all parents of each node, forming cliques, and all edges of the graph are changed to be non-directed. For further information on moralization and its importance for exact inference procedures, see [29].