## Christine Sinoquet and Raphaël Mourad

Print publication date: 2014

Print ISBN-13: 9780198709022

Published to Oxford Scholarship Online: December 2014

DOI: 10.1093/acprof:oso/9780198709022.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2020. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use.  Subscriber: null; date: 19 January 2020

# Detection of Copy Number Variations from Array Comparative Genomic Hybridization Data Using Linear-chain Conditional Random Field Models

Chapter:
(p.409) Chapter 16 Detection of Copy Number Variations from Array Comparative Genomic Hybridization Data Using Linear-chain Conditional Random Field Models
Source:
Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics
Publisher:
Oxford University Press
DOI:10.1093/acprof:oso/9780198709022.003.0016

# Abstract and Keywords

Copy number variation (CNV) accounts for roughly 12% of the human genome. Beside their inherent role in cancer development, CNVs have been reported to underlie susceptibility to complex diseases. Each variation may range from around 1000 nucleotides to less than 5 megabases. Array comparative genomic hybridization (aCGH) allows the identification of copy number alterations across genomes. The key computational challenge in analyzing CNVs using aCGH data is the detection of segment boundaries of copy number changes and inference of the copy number state for each segment. Markov random fields and, more specifically, conditional random fields provide a unified framework for data preprocessing, segmentation and copy number state decoding.

Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics. First Edition. Christine Sinoquet & Raphaël Mourad (Eds).© Oxford University Press 2014. Published in 2014 by Oxford University Press.

To gain more understanding of the role of inheritable copy number polymorphisms (CNPs) in determining disease phenotypes, systematic mapping and cataloging of CNPs are being carried out. Array comparative genomic hybridization (aCGH) allows identification of copy number alterations across genomes. The key computational challenge in analyzing copy number variations (CNVs) using aCGH data is the detection of segment boundaries of copy number changes and inference of the copy number state for each segment. This chapter proposes a novel undirected graphical model based on conditional random fields for CNV detections in aCGH data. This model effectively combines data preprocessing, segmentation, and copy number state decoding into one unified framework. The approach depicted, termed CRF-CNV, provides great flexibility in defining meaningful feature functions. Therefore, it can effectively integrate local spatial information of arbitrary sizes into the model. For model parameter estimation, the conjugate gradient (CG) method has been adopted to optimize the likelihood, and (p.410) efficient forward/backward algorithms have been developed within the CG framework. The final step of copy number decoding is based on the Viterbi algorithm. The method was evaluated using real data as well as data simulated with realistic assumptions. Finally, the method is compared with two popular, publicly available programs.

# 16.1 Introduction

Structure variations in DNA sequences such as inheritable copy number alterations have been reported to be associated with numerous diseases. It has also been observed that somatic chromosomal aberrations (i.e., amplifications and deletions) in tumor samples have shown different clinical or pathological features in different cancer types or subtypes [3, 6, 14]. To gain more understanding of the role of inheritable copy number polymorphisms (CNPs) in determining disease phenotypes, systematic mapping and cataloging of CNPs are needed and are being carried out. Identification of somatic copy number aberrations in cancer samples may lead to the discovery of important oncogenes or tumor suppress genes. With remarkable capacity from existing technologies in assessing copy number variants (CNVs), there is a great wave of interest recently from the research community to investigate inheritable as well as somatic CNVs [1, 3, 4, 6, 14, 16, 23].

Broadly speaking, there are essentially three technological platforms for CNV detection: array-based technology (including array comparative genomic hybridization (aCGH) [13, 20], as well as many other variants), single nucleotide polymorphism (SNP) genotyping technology [1, 14] and next-generation sequencing technology [2]. Array-based technology measures DNA copy number alterations from a disease/test sample relative to a normal/reference sample. In principle, for each gene (or a genomic segment), each individual inherits one copy from its father and one copy from its mother. The total number of copies is two. However, when there are copy number mutations in an individual, the total number of copies might be one (i.e., deletions), or three or more (i.e., amplifications/insertions). For each DNA fragment or “clone,” aCGH can indirectly measure the number of copies through fluorescence intensities, which represent the abundance of DNAs hybridized to the segment. The intensity ratio of the test and the reference sample, measured in log scale, is expected to be proportional to the copy number change of the test sample relative to the reference sample (assumed to have two copies), though significant noise can be introduced from various sources during the process. Array-based technologies are primarily for large segments of amplifications and deletions, though different experimental platforms and designs using clones with different sizes may give very different resolutions and genome coverage [14]. The other two platforms, SNP genotyping and next-generation sequencing, use different techniques to measure copy number changes, and have gained popularity in recent years [2, 17]. We primarily focus on aCGH data analysis in this chapter.

Different platforms have different challenges. Naturally, one should use different approaches for these platforms by taking advantage of special properties from different data sets. Not surprisingly, various algorithms have been proposed for different data in recent years. On the other hand, the primary goal of all such studies is to identify contiguous sets of segments that share the same copy numbers. Therefore, the essential computational task for processing data from different platforms is the same: segment genome into discrete regions of the same copy number. Once the above segmentation step is done, each segment will be assigned a copy number. The latter task is called “classification”. One important commonality in performing segmentation for data from different platforms is the spatial correlation among clones. Many existing approaches have (p.411) taken advantage of such a property by utilizing the same methodology, Hidden Markov Models (HMMs), which can conveniently model spatial dependence using a chain structure. Results have shown initial success of HMMs [1, 4, 16, 23]. However, there is an inherited limitation for all these HMMs, i.e., they are all first-order HMMs and cannot take into consideration long-range dependence.

It has been shown that, in a variety of domains, conditional random fields (CRFs) consistently outperform HMMs, mainly because CRFs can potentially integrate all information from data [21]. This property makes CRFs particularly appealing to model CNV data since one can define feature functions using data from a region rather than a single or two data points for emissions and transitions in HMMs. We have proposed a novel undirected graphical model based on CRFs for CNV detections for aCGH data [25]. Our model effectively combines data preprocessing, segmentation, and copy number state decoding into one unified framework. Our approach (termed CRF-CNV) provides great flexibilities in defining meaningful feature functions; therefore it can effectively integrate local spatial information of arbitrary sizes into the model. For model parameter estimations, we have adopted the conjugate gradient (CG) method for likelihood optimization and developed efficient forward/backward algorithms within the CG framework. The final step of copy number decoding is based on the Viterbi algorithm. The method is evaluated using real data with known copy numbers as well as simulated data with realistic assumptions, and compared with two popular publicly available programs. Experimental results have demonstrated that CRF-CNV outperforms a Bayesian Hidden Markov Model-based approach on both data sets in terms of copy number assignments. When compared to a non-parametric approach, CRF-CNV has achieved much greater precision while maintaining the same level of recall on the real data, and the performances of the two methods are comparable when applied to simulated data.

The remainder of this chapter is organized as follows. In Section 16.2, we give a brief overview of aCGH data and existing approaches for detecting CNVs from aCGH data. Details about CRF model developments for CNVs and implementations are provided in Section 16.3. Our experimental results on two data sets and comparisons with other programs are presented in Section 16.4. We conclude this chapter with a few discussions in Section 16.5.

# 16.2 aCGH Data and Analysis

## 16.2.1 aCGH Data

Though, theoretically, our approach can be applied to data from different experimental platforms, we focus primarily on aCGH data in this analysis. Mathematically, aCGH data usually consist of an array of log2 intensity ratios for a set of clones, as well as the physical position of each clone along a genome. Here, each clone is one data point in the array and all the clones are ordered according to their physical positions on the genome. Fig. 16.1(page 412) plots the normalized log2 ratio of a human cancer cell line (i.e., sample) and normal reference DNA analyzed by Snijders et al. [18]. Each data point represents one genome segment/clone and the y-axis represents normalized log2 intensity ratio. The primary goal in CNV detection based on aCGH is to segment a genome into discrete regions that share the same mean log2 ratio pattern (i.e., have the same copy numbers). In a normal sample from humans, the copy number is usually two for all the autosomes. Ideally, the log2 ratio of a clone should be 0 if a test sample (e.g., a cancer cell line) also has two copies of DNA, and the value should be 0.585 (or − 1), if it has one single-copy gain (p.412) (or single-copy loss), etc. The dotted lines in Fig. 16.1 from bottom to up show the three values for one, two, and three copies, respectively. A straightforward approach is to use some global thresholds to assign the number of copies for each clone based on its log2 ratio. However, as shown in Fig. 16.1, aCGH data can be quite noisy with vague boundaries between different segments. It may also have complex local spatial dependence structure. These properties make the segmentation problem intrinsically hard. Approaches using global thresholds generally do not work well in practice.

Fig. 16.1 An aCGH profile of a Corriel cell line (GM05296). The borders between chromosomes are indicated by gray vertical lines. Dotted lines indicate the expected log2 ratio $TR$ (where T represents the gene expression level in the testing sample and R represents the gene expression level in the reference sample) for single-copy loss, same number of copies, and single-copy gain of this cell line relative to reference DNA.

## 16.2.2 Existing Algorithms

In general, a number of steps are needed to detect copy number changes from aCGH data. First, raw log2 ratio data usually needs some preprocessing, including normalization and smoothing. Normalization is an absolutely necessary step to alleviate systemic errors due to experimental factors. Usually the input data are normalized by taking advantage of some control data sets where there are no differences in test and reference DNA. The purpose of normalization is to make the median or mean log2 ratios in control data sets to be 0. Smoothing is used to reduce noise that is due to random errors or abrupt changes. Smoothing methods generally filter the data using a sliding window, attempting to fit a curve to the data while handling abrupt changes and reducing random errors. In addition to methods based on sliding windows, several other techniques have been proposed, which include quantile smoothing, wavelet smoothing, etc. (see [5, 8]).

The second step in analyzing aCGH data is referred to as segmentation and it aims to identify segments that share the same mean log2 ratio. Broadly speaking, there are two related estimation problems. One is to infer the number and statistical significance of the alterations; the other is (p.413) to locate their boundaries accurately. A few different algorithms have been proposed to solve these two estimation problems. Olshen et al. [12] have proposed a non-parametric approach based on the recursive circular binary segmentation (CBS) algorithm. Hupe et al. [9] have proposed an approach called GLAD, which is based on a median absolute deviation model, to separate outliers (observations that are numerically distant from the rest of the data) from their surrounding segments. Willenbrock and Fridlyand [24] have compared the performance of CBS (implemented in DNACopy) and GLAD using a realistic simulation model, and it was concluded that CBS in general is better than GLAD. We have adopted the simulation model from [24] in our experiment study. After obtaining the segmentation outcomes, a post-processing step is needed to combine segmentations with similar mean levels and to classify them as single-copy gain, single-copy loss, normal, multiple gains, etc. Methods such as GLADMerge [9] and MergeLevels [24] can post-process the segmentation results and label them accordingly.

As noted by Willenbrock and Fridlyand [24], it is more desirable to perform segmentation and classification simultaneously. An easy way to merge these two steps is to use a linear chain HMM. The underlying hidden states are the real copy numbers. Given a state, the log2 ratio can be modeled using a Gaussian distribution. The transition from one state to another state reveals the likelihood of copy number changes between adjacent clones. Given observed data, standard algorithms (forward/backward, Baum–Welch and Viterbi) can be used to estimate parameters and to decode hidden states. A few variants of HMMs have been proposed for aCGH data in recent years [7, 16]. Lai et al. [11] have shown that HMMs performed the best for small aberrations given a sufficient signal/noise ratio. Guha et al. [7] have proposed a Bayesian HMM which can impose biological meaningful priors on the parameters. Shah et al. [16] have extended this Bayesian HMM by incorporating outliers and location-specific priors, which can be used to model inheritable copy number polymorphisms.

Notice that all these models are first-order HMMs which cannot capture long-range dependence. Intuitively, it makes sense to consider high-order HMMs to capture informative local correlation, which is an important property observed from aCGH data. However, considering higher orders will make HMMs more complex and computationally intensive.

# 16.3 Linear-chain CRF Model for aCGH Data

To overcome the limitations of HMMs, we propose a new model based on the theory of linear-chain conditional random fields (CRFs) [10, 21]. CRFs are undirected graphical models designed for calculating the conditional distribution of output random variables Y given input variables X. The term “random field,” which is equivalent to “Markov network” in graphical theory, has been widely used in statistical physics and computer vision. CRFs are also known as conditional Markov networks [22]. Because CRFs are conditional models, dependences among variables X do not need to be explicitly specified. Therefore, one can define meaningful feature functions that can effectively capture local spatial dependence by using input variables X. CRFs have been widely applied to language processing, computer vision, and bioinformatics, with remarkable performance when compared with directed graphical models including HMMs.

Fig. 16.2 A linear-chain conditional random field model, where Yi’s are output variables and Xi’s are input variables. They are connected through gray square nodes (which are not variables, but factors in factor graphs), indicating feature functions will be defined on those variables connected by square nodes. The dotted lines mean that for two adjacent output variables Yi and Yi+1, an arbitrary set of consecutive input variables consisting of the two corresponding input variables ({…, Xi, Xi+1, …}) can be linked to them.

In general, a linear-chain CRF (see Fig. 16.2, page 414) is defined as the conditional distribution

$Display mathematics$

(p.414) where the partition function

$Display mathematics$

Here $θ(θ={θij})$ are parameters. Functions ${fij}$ are feature functions. $X˜i$ is a neighbor set of Xi that is needed for computing features relating clones i and i + 1. S(i) is the total number of feature functions related to Yi and Yi+1. The variables connected by each square node in Fig. 16.2 are used to define potential feature functions. Each feature function will operate on pairs of adjacent label variables Yi and Yi+1 and a neighbor set of Xi.

We define the linear-chain CRF model for aCGH data for one sample (actually one chromosome in one sample because different chromosomes can be treated independently) as follows. The general model for a set of observations can be easily obtained assuming independence between observations. Let $X=(X1,…,Xn)$ denote the normalized log2 ratio intensities along one chromosome for a sample/cell line, where Xi is the log2 ratio for clone i. One can assume that these n clones are sequentially positioned on a chromosome. Let $Y=(Y1,…,Yn)$ denote the corresponding hidden copy number states, where Yi belongs to ${1,..,s}$ and s is the total number of copy number states. These states usually indicate deletion, single-copy loss, neutral, single-copy gain, two-copy gain, or multiple-copy gain. The exact number of states and their meaning need to be specified based on specific input data. The conditional probability of Y given the observed log2 ratio X based on our linear-chain CRF structure is defined as

(16.1)
$Display mathematics$

(p.415) where the partition function is:

$Display mathematics$

Here $θ={λj,μj,ωj,νjk}$ are parameters that need to be estimated. Functions $fj,gj,lj$, and hjk are feature functions that need to be defined. $X˜i(u)$ is defined as a neighbor set of Xi around clone i, i.e., $X˜i(u)={Xi−u,…,Xi−1,Xi,Xi+1,…,Xi+u}$, where u is a hyperparameter meant to define the dependence length. In principle, u can be defined based on physical distances between clones instead of the number of clones, which may better capture local dependence when clones are unevenly distributed on a chromosome. For simplicity, we use the number of clones here. For $i≤u$ or $i≥n−u$, $X˜i(u)$ is redefined by proper truncation. For example $X˜1(3)={X1,…,X4}$. Similarly, we define $X˜i,i+1(u)={Xi−u,…,Xi−1,Xi,Xi+1,Xi+2,…,Xi+1+u}$, $X˜i,i+1−(u)={Xi−u,…,Xi−1,Xi}$ and $X˜i,i+1+(u)=Xi+1,Xi+2,…,Xi+1+u}$. The two latter terms will be further used.

As illustrated in Fig. 16.2, although we also use a chain structure in our CRF model dedicated to CNV detection, the feature functions to be defined rely on observed data from a region. We define our feature functions by integrating all information from these neighborhood sets. Therefore, these feature functions can capture abundant local spatial dependence. The dependence length u plays a similar role as the width of the sliding window in smoothing methods. In addition, by using a CRF, we can effectively combine smoothing, segmentation, and classification into one unified framework. For notational simplification, we drop the parameter u in our subsequent discussions and write $X˜i(u)$ as $X˜i$, $X˜i,i+1(u)$ as $X˜i,i+1$, and so on.

## 16.3.1 Feature Functions

One important step for building our model is to define meaningful feature functions that can capture critical information from input data. Essentially, we define two types of feature functions, analogous to the emission and transition probabilities in HMMs. However, our feature functions can be of any form. Therefore, our model can provide much more flexibility and be able to capture long-range dependence. The emission feature functions $fj(Yi,X˜i)$ and $gj(Yi,X˜i)$ are defined as follows:

$Display mathematics$
$Display mathematics$

where $medX˜i$ is defined as the median value of set $X˜i$. Our emission features serve two purposes. First they are used as median filters that will automatically smooth the input data. More (p.416) importantly, the feature functions based on the first-order and second-order median statistics are robust sufficient statistics one can derive from a normal distribution.

The transition feature function $hjk(Yi,Yi+1,X˜i,i+1)$ and the initial feature function $lj(Y1,X˜1)$ are defined as follows:

$hjk(Yi,Yi+1,X˜i,i+1)=$

$Display mathematics$
$Display mathematics$

Here aj denotes the mean log2 ratio for clones with copy number state $j(j=1,…,s$), while a0 and as+1 denote the greatest lower bound of log2 ratio for clones with copy number state 1 and the least upper bound of log2 ratio for clones with copy number state s, respectively. Without loss of generality, we assume $a0. We define the initial feature function $lj(Y1,X˜1)$ in a way such that data from the clone set $X˜1$ will only provide information to its own state. Furthermore, when Y1 is equal to j, the closer the value of $medX˜1$ to aj, the higher the value for $lj(Y1,X˜1)$, the more information the data will provide, resulting in a higher contribution to parameter $ωj$. The function $lj(Y1,X˜1)$ will achieve its highest value of 1 when $medX˜1$ is equal to aj. The other terms (i.e., $(aj+1−aj)/2$ and $(aj−aj−1)/2$) in $lj(Y1,X˜1)$ are scale factors, which make $lj(Y1,X˜1)$ to be 1/2 when the value of $medX˜1$ is half way between aj-1 and a j or half way between $aj+1$ and a j (i.e., $medX˜1=(aj−1+aj)/2$, or $medX˜1=(aj+1+aj)/2$). The transition feature function $hjk(Yi,Yi+1,X˜i,i+1)$ is similarly defined using the clone set $X˜i,i+1$. When Yi equals j and Yi+1 equals k, the closer the value of $medX˜i,i+1−$ to aj and the value of $medX˜i,i+1+$ to ak, the higher the value of $hjk(Yi,Yi+1,X˜i,i+1)$ and the more information the data will contribute to ν‎jk. Those scale factors serve the same purpose as those in the emission feature functions. Clearly, (p.417) both types of our feature functions can capture the local spatial dependence over a set of adjacent clones and thus potentially provide more robust inference about hidden copy number states.

## 16.3.2 Parameter Estimation

Unlike the standard algorithms for HMM training, there are significant computational challenges for efficiently and accurately estimating parameters for CRFs. For example, the partition function Zθ‎(X) in model 16.1 needs to be explicitly calculated in order to estimate parameters. In contrast, there is no such partition function to be considered for HMMs. The implementation of the training algorithms for our proposed linear-chain CRF model requires sophisticated statistical and numerical algorithms. To our best knowledge, no existing implementations can be used to solve this problem. We propose the following algorithm for the parameter estimation.

In general, given a set of training data $D$ consisting of D observations, i.e., individuals $D={(X(d),Y(d)),d=1,…,D}$, to estimate parameter θ‎ ($θ={λj,μj,ωj,νjk}$) in model 16.1, one needs to maximize a penalized conditional log-likelihood which is defined as follows:

(16.2)
$Display mathematics$

Here $∥θ∥$ is the L2-norm of θ‎.1 The penalization term $∥θ∥2/2σ2$ is added for regularization, which is used to prevent overfitting. Before one can solve the optimization problem, one has to first specify an additional set of hyperparameters that include the dependence length u, the mean log2 ratios ${aj,j=0,…,s+1}$ and the penalization coefficient σ‎2. For each j belonging to ${1,…,s}$, the set of ${aj}$ can be directly estimated given the training data set $D$, i.e., the maximum likelihood estimate of aj is just the mean value of the log2 ratios for all clones with copy number state j in $D$ for $j=1,…,s$. In addition, a0 and as+1 can be imputed using the minimum log2 ratio of all clones with copy number state 1, and the maximum value from all clones with copy number state s, respectively. For the dependent length u and the penalization coefficient σ‎2, we rely on a grid-search approach through cross-validation, which performs an exhaustive search by partitioning the search space into small grids. More specifically, the original training set $D$ will first be partitioned into two sets $D1$ and $D2$. We call $D1$ the new training set and $D2$ the validation set. For a given range of (discrete) parameter values of u and σ‎2, we train the model on $D1$ and get estimates of θ‎ for each fixed pair of (u0, $σ02$). The exact procedure to estimate θ‎ given (p.418) (u0, $σ02$) will be further discussed shortly. We then apply the trained model with estimated parameters on the validation set $D2$ and record the prediction error under the current model. The model with the smallest prediction error as well as the associated parameters ($u,σ2,θ$) will be kept. The prediction error is defined as the mean absolute error for all samples in the validation set $D2$. The absolute error for a clone i is defined as $∣Yi−Yˆi∣$, where Yi is the known copy number and $Yˆi$ is the predicted copy number. This measure not only captures whether a prediction is exactly the same as the real copy number but also reflects how close these two numbers are.

For a given set of hyperparameters ${aj}$, u, and σ‎2, the optimization of Lθ‎ in equation 16.2 can be solved using gradient-based numerical optimization methods [19]. We choose the non-linear conjugate gradient (CG) method in our implementation, which generalizes the CG method to non-linear optimization. To find a local optimum of a non-linear function, the non-linear CG method only needs to calculate the gradient, i.e., the first derivatives of Lθ‎ in our problem. The outline of our non-linear CG method is as follows. Given the objective function Lθ‎, we search in the gradient direction

$Display mathematics$

where $∂Lθ∂θ$ is the first derivative function of Lθ‎ and $θ(0)$ is the initial value of θ‎. Then, we perform a line search in this direction until Lθ‎ reaches the maximum and update the parameter θ‎ accordingly:

$Display mathematics$

After this first iteration in the gradient direction $d(0)$, the following steps constitute one iteration of moving along a subsequent conjugate direction d(i) and update θ‎ accordingly:

1. 1. Calculate the gradient direction $ξ(i)=∂Lθ∂θ∣θ=θ(i)$,

2. 2. compute $β(i)=(ξ(i))′ξ(i)(ξ(i−1))′ξ(i−1)$,

3. 3. update the conjugate direction $d(i)=ξ(i)+β(i)d(i−1)$,

4. 4. perform a line search: optimize $α(i)=argmaxαL(θ(i)+αd(i))$,

5. 5. update $θ(i+1)=θ(i)+α(i)d(i)$.

The CG procedure runs until convergence. The final θ‎ is regarded as the optimal parameter. The first-order derivatives of Lθ‎ with respect to ${λj}$, ${μj}$, ${ωj}$, and ${νjk}$ are given by

$Display mathematics$

(p.419)

$Display mathematics$

The marginal distributions $P(Yi=y|X(d)),(i=1,…,n)$ and $P(Yi=y,Yi+1=y∗|X(d)),(i=1,…,n−1)$ in the above derivatives can be efficiently computed using the forward–backward algorithms. First, define the following forward variables γ‎i and backward variables η‎i for i = 1,…,n:

$Display mathematics$

where the functions F, G, L and H are defined based on the feature functions:

$Display mathematics$

(p.420) The marginal distributions can be obtained by combining forward and backward variables:

(16.3)
$Display mathematics$
(16.4)
$Display mathematics$

and $Zθ(X(d))$ can also be efficiently computed using forward variables $Zθ(X(d))=∑y=1sγn(y,d).$

We noticed that the calculation of the forward variables γ‎i and backward variables η‎i is not stable. It may suffer from overflows due to numerous products of exponentials. To achieve numerical stability, we take a log transformation in calculating the recursions. Similar techniques have been used in HMMs [15]. To calculate γ‎i(y,d) according to the formula defined earlier, first, let $y0=argmaxylogγi−1(y,d)$. We will then divide each component in the recursive formula of γ‎i(y,d) before exponentiation by $γi−1(y0,d)$. After taking log transformation, the new recursive relationship becomes:

$Display mathematics$

The backward variables $ηi(y,d)$ can be transformed similarly:

$Display mathematics$

where $z0=argmaxylogηi+1(y,d)$.

The marginal distributions in formulae 16.3 and 16.4 can be calculated using stable $logγi$ and $logηi$. For example, $P(Yi=y|X(d))=exp{logγi(y,d)+logηi(y,d)−logZθ(X(d))},y=1,…,s;i=1,…,n,$ where $logZθ(X(d))=logγn(y0,d)+log(1+∑y≠y0exp[logγn(y,d)−logγn(y0,d)]),$ $y0=argmaxylogγn(y,d).$

(p.421) It is important to discuss the computational cost of training. The partition function $Zθ(X)$ in the likelihood and the marginal distributions $P(Yi=y|X(d))$, $P(Yi=y,Yi+1=y∗|X(d))$ in the gradient can be efficiently computed using the standard forward–backward algorithms. The forward–backward algorithm required for the three computations has an $O(ns2)$ complexity, where n is the number of clones and s is the number of hidden states. However, each training data set will have its own partition function and marginal distributions, so we need to run a forward–backward procedure for each observation in the training data set and for each gradient computation. Therefore, the total cost for training is $O(ns2dg)$, where d is the number of observations in the training data set, and g is the number of gradient computations required by the optimization procedure.

For graphical-model-based approaches such as HMMs, many researchers pool data across all chromosomes from all samples, which can dramatically reduce the number of parameters needed without sacrificing much on inference accuracy. We also take a similar approach. This is reflected by our homogeneous linear-chain CRF structure.

## 16.3.3 Evaluation Methods

We implemented the above proposed approach as a Matlab package termed CRF-CNV and evaluated its performance using a publicly available real data set with known copy numbers [18] and a synthetic data set generated in [24]. Notice that many clones have normal (two) copies of DNA; therefore, the number of correctly predicted state labels is not a good measure of performance of an algorithm. Instead, we compare the performance of CRF-CNV with two popular programs in terms of the number of predicted segments and the accuracy of segment boundaries, referred to as breakpoints. To summarize the performance of an algorithm over multiple chromosomes and individuals, we use a single value called F-measure, which is a combination of precision and recall. Recall that given the true copy number state labels and predicted labels, precision (P) is defined as $ntpnp$ and recall (R) is defined as $ntpnt$, where ntp is the number of true positive (i.e., correctly predicted) breakpoints, np is the number of predicted breakpoints, and nt is the number of true breakpoints. F-measure is defined as F=2PR/(P+R), which aims to find a balance between precision and recall. The two programs we chose are CBS [12] and CNA-HMMer [16], both of which have been implemented as Matlab tools. As mentioned earlier, CBS is one of the most popular segmentation algorithms, and different groups have shown it in general performs better than many other algorithms. CNA-HMMer was chosen because we wanted to compare the performance of our CRF model with HMMs, and CNA-HMMer is an implementation of a Bayesian HMM model with high accuracy [16].

# 16.4 Experimental Results

## 16.4.1 A Real Example

The Coriell data are regarded as a well-known “gold standard" data set which was originally analyzed in [18]. The data can be downloaded from http://www.nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html and have been widely used in testing new algorithms and in comparing different algorithms. The CBS algorithm was applied to this data set in the original paper (p.422) [12]. The Coriell data consist of 15 cell lines, named GM03563, GM00143, …, GM01524. We simply use the numbers 1,2,…,15 to represent these cell lines. For this particular data set, there are only three states (s = 3), i.e., single-copy loss, normal copy, and single-copy gain. Notice that unlike CBS, CRF-CNV requires training data to estimate parameters. It is unfair to directly compare the prediction results of CRF-CNV on training data with results from CBS. We took a simple approach which divides the 15 samples into three groups. Each group consists of 5 samples. In the first run, we used group 1 as training data and group 2 as validation data to obtain model parameters (as discussed in SubSection 16.3.2). We then used the model to predict data in group 3 (testing data), and recorded the prediction results. In the second and third run, we alternated the roles of groups 1–3 and obtained prediction results for samples in group 1 and group 2, respectively. Finally we summarized our results over all 15 samples. For example, for the first run, we first obtained ${aj,j=0,…,4}$ directly based on samples in group 1. The estimates for ${aj}$ were $−1.348,−0.682,−0.001$, 0.497, 0.810. To estimate the penalization coefficient σ‎2 and the dependent length u, we defined the search space as $A×B={0,1,2,…,30}×{0,1,…,5}$. For each value m in A, we let $σ2=400×0.8m$ and u=u0. Essentially, to search σ‎2 in a broad range, we used a geometric decay. The upper bound on u was set as 5 because for aCGH data such as the Coriell data set, each clone can cover quite a long range of DNA. The optimal σ‎2 and u were chosen by minimizing the prediction errors on samples in group 2 (the validation set). Our results indicated that the model where u and m are respectively equal to 1 and 21 achieves the lowest prediction error. The values of θ‎s corresponding to the best model with the lowest prediction errors were recorded. We then applied Viterbi’s algorithm to find the most possible hidden copy number states for samples in group 3, as well as the number and boundaries of segments. Run 2 and run 3 obtained results on group 1 and group 2. For the CNA-HMMer, one could either use its default priors, or use training data to obtain informative priors. We tested the performance of CNA-HMMer both with and without informative priors.

Table 16.1 Comparison of the number of segments identified along the genome, by three algorithms, for each sample of the Coriell data set.

Method/Sample

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Sum

Gold standard

5

3

5

3

5

3

5

5

5

5

3

5

2

3

3

60

CRF-CNV

5

3

5

3

5

3

5

5

3

3

3

5

2

3

3

56

CBS

17

42

7

6

9

5

5

6

5

5

13

7

17

3

9

156

CNA-HMMer(default)

9

83

9

7

11

3

7

5

11

11

21

5

16

10

16

224

CNA-HMMer (trained)

3

3

5

3

5

3

5

5

3

3

3

5

2

3

3

54

Table 16.1 shows the number of segments identified along the genome, for each individual from the gold standard. Table 16.1 similarly describes the predicted outcomes of the three algorithms CRF-CNV, CBS, and CNA-HMMer. The number of segments detected by CRF-CNV is exactly the same as the gold standard for almost all samples (except for samples 9 and 10). Further examination of samples 9 and 10 (see Fig. 16.3) reveals that the segment that we missed in sample 9 only has one clone, which has been smoothed out by our algorithm. The segment (p.423) missed in sample 10 is also very short, and the signal is very weak. Our results clearly show that CBS generated many more segments comparing to the ground truth, which is consistent with the results in the original paper [18]. The overall number of segments reported by CNA-HMMer with default priors was even greater than the total number from CBS. On the other hand, once we used training data to properly assign informative priors for CNA-HMMer, it returned almost the same number of segments as CRF-CNV. The only exception was that CNA-HMMer missed one breakpoint in sample 1. This illustrates that by using correctly labeled training data, both CRF-CNV and CNA-HMMer can effectively eliminate all false positives in this data set. For the subsequent experiments, in the comparisons, we only report the results of CNA-HMMer with proper training.

Fig. 16.3 Predicted breakpoints by CRF-CNV (bottom) versus true breakpoints (top) on the two cell lines A) GM01535 and B) GM07081. R, gene expression level in the reference sample; T, gene expression level in the testing sample. Reproduced in color in the color plate section.

Fig. 16.3 Predicted breakpoints by CRF-CNV (bottom) versus true breakpoints (top) on the two cell lines A) GM01535 and B) GM07081. R, gene expression level in the reference sample; T, gene expression level in the testing sample.

As a comparison measure, the number of segments is a very rough index because it does not contain information about breakpoints. To further examine with which accuracy the breakpoints are predicted by each approach, we pooled all the breakpoints from all the samples and used the F-measure to compare the performances of the three algorithms. Notice that even though exact matches were possible, shifting by a few clones around boundaries was also likely given noisy input data. Therefore we used a match extent index D to allow some flexibility in defining matches of predicted breakpoints to those given by the gold standard. Table 16.2 shows F-measures given different match extent values for CRF-CNV, CNA-HMMer, and CBS. Clearly, CBS has the worst performance regardless of the match extent values. This partially reflects that it yields many false positives. The result for CNA-HMMer is quite accurate when no match extent is allowed. Then, the F-measure shows a modest increase as the value of D increases from 0 to 1. The result of CRF-CNV lies in between when the match index D is equal to 0. However, the performance of CRF-CNV is greatly enhanced when D is equal to 1, and finally it outperforms CNA-HMMer when D (p.424) is greater than or equal to 2. The primary reason that CRF-CNV shifted one or a few positions for many breakpoints is due to the automatic median smoothing step. In contrast, CNA-HMMer directly models outliers using prior distributions.

Table 16.2 Comparison of F-measures with different match extent values for three algorithms.

Method/Match Extent

0

1

2

3

4

CRF-CNV

0.638

0.914

0.948

0.967

0.967

CNA-HMMer

0.877

0.947

0.947

0.947

0.947

CBS

0.333

0.500

0.519

0.519

0.519

## 16.4.2 Simulated Data

Though results on the real data show that CRF-CNV has a higher performance than CBS and is comparable with CNA-HMMer when the match extent index is relaxed to 2, the experiment is limited because the data set size is very small. To further evaluate the performance of CRF-CNV, we tested the three algorithms using a simulated data set obtained from [24]. The data set consists of 500 samples each with 20 chromosomes. Each chromosome contains 100 clones. Each clone can be affected one of six possible copy number states. The authors generated these samples by sampling segments from a data set of primary breast tumor tissues and used several mechanisms (e.g., the fraction of cancerous cells in each sample, the variation of intensity values given a copy number state) to control the noise level. By using simulated data from the literature, we further evaluated the performance of CRF-CNV.

Table 16.3 Comparison of the numbers of segments predicted by three different approaches run on simulated data.

Method/Data

Group 2

Group 3

Gold

997

8299

CRF-CNV

966

8868

CNA-HMMer

784

6692

CBS

867

7430

To train CRF-CNV, we divided the 500 samples into three groups as usual. This time, the training set group 1 contained samples 1–50, the validation set group 2 contained samples 51–100, and the test set group 3 contained samples 101–500. We used the same grid search approach as discussed earlier to obtain hyperparameters ${aj}$, u, and σ‎2. For each fixed set of hyperparameters, we used the CG method to obtain parameter θ‎. Finally, we used Viterbi’s algorithm to decode the most possible hidden copy number state labels for samples in group 3 and compared the results with CNA-HMMer and CBS. In addition, we also compared the predictions by CRF-CNV for group 2 and group 3 to see with new testing data how much deterioration our model might incur based on sub-optimal parameters inferred from small number of samples. For easy comparison, (p.425) the results from CBS and CNA-HMMer are presented separately for these two groups. We also used group 1 as training data to assign informative priors for CNA-HMMer.

Table 16.3 shows the total numbers of segments in group 2 and group 3 predicted by CRF-CNV, CBS, and CNA-HMMer, respectively, and compares them with the known total number of segments. Interestingly, for this simulated data, both CBS and CNA-HMMer predicted a smaller number of segments. CRF-CNV predicted a smaller number of segments on group 2 and a greater number of segments in group 3. However, the number of segments does not provide a whole picture. We therefore examined the accuracy of boundary prediction by each method using the F-measure for both group 2 and group 3. Table 16.4 shows the F-measures for different methods, different groups, and different match extents. As expected, the F-measure increases as D increases from 0 to 4 for all methods and for both data groups. It is also not surprising to see that the results of CBS on group 2 and group 3 are consistent, like those for CNA-HMMer with group 2 and group 3. Interestingly, the performance of CRF-CNV with group 3 is also very close to its own performance with group 2. This property is desirable because it illustrates the robustness of CRF-CNV. The performance with new testing data was almost the same as the performance with validation data, which were used to select optimal model parameters. Notice that the sizes of training data and validation data are also very small. One can expect that with a training data set of small size, our approach can be used to reliably predict other data generated under the same experimental conditions. In terms of the performance of the three approaches, CNA-HMMer is more accurate than CRF-CNV, and CBS is the worst for the case of exact match. However, when we relax the matching criterion by increasing the value of D, both CBS and CRF-CNV achieve better performance than CNA-HMMer. The results of CNA-HMMer and CRF-CNV are consistent with those from the real data. CBS shows significantly higher performances compared to those observed for real data. However this might be attributed to the simulation process, because CBS was used to segment the 145 samples from the primary breast tumor data set [24].

Table 16.4 Comparison of F-measures with different match extent values for three algorithms and different test sets.

Data

Method/Match Extent

0

1

2

3

4

Group 2

CRF-CNV

0.590

0.792

0.875

0.900

0.906

CNA-HMMer

0.702

0.801

0.832

0.852

0.855

CBS

0.436

0.850

0.885

0.900

0.909

Group 3

CRF-CNV

0.568

0.786

0.864

0.889

0.896

CNA-HMMer

0.697

0.805

0.840

0.858

0.869

CBS

0.436

0.847

0.893

0.911

0.918

# 16.5 Conclusion

The problem of detecting copy number variations has drawn much attention in recent years, and many computational approaches have been proposed to solve this issue. Among these computational developments, CBS has gained much popularity, and it has been shown that it generally (p.426) performs better than other algorithms on simulated data [24]. However, as shown in the original paper (as well as rediscovered by our experiments), CBS was reported to yield many more false positives on copy number changes in the standard Coriell data set identified by spectral karyotyping [18]. Another commonly used technique for segmentation is the HMM. HMM approaches have the advantage of performing parameter (i.e., means and variances) estimation and copy number decoding within one framework, and their performance is expected to improve with more observations. However, almost all HMMs dedicated to aCGH are first-order Markov models and thus cannot incorporate long-range spatial correlations within data.

We have presented a novel computational model based on the theory of conditional random fields. Our model provides great flexibility in defining meaningful feature functions over an arbitrary neighborhood. Therefore, more local information can be incorporated, and more robust results are expected. In this chapter, we have defined feature functions using robust sufficient statistics that can effectively incorporate smoothing into our unified framework. We have also developed effective forward–backward algorithms within the conjugate gradient method for efficient computation of model parameters. We evaluated our approach using real data as well as simulated data, and the results show that our approach performed better than a Bayesian HMM on both data sets when a small shift is allowed while mapping breakpoints. When compared with CBS, our approach has fewer false positives with the real data set. With the simulated data set, the performance of our approach is comparable to that of CBS, which has been shown to be the best among the three popular segmentation approaches.

Like any other CRFs, in order to train our model, one has to rely on some training data. To be practically useful, Bayesian HMMs such as CNA-HMMer also need training data for proper assignment of informative priors. We argue that the problem is not as serious as it initially appears to be, primarily for two reasons. First, as illustrated in our experiments, our algorithm is indeed very robust and performs consistently, even if one cannot find optimal estimates for model parameters. For example, we used a simplified procedure in the analysis of the simulated data set by randomly picking one subset for training. Theoretically, parameters estimated from such a procedure might heavily depend on this particular subset and might not be necessarily globally optimal. However, the results shown in Table 16.4 demonstrate that the performance with new testing data is almost the same as with the validation data, which were used to tune the parameters. A 10-fold cross-validation is currently being performed to further verify this observation. Furthermore, the training size required by our algorithm is very small, as illustrated by both the real and the simulated data. The primary reason that our algorithm reported no false positives for segment numbers with the Coriell data is mainly that we learned the structure within the data by training. The drastic differences in performance by CNA-HMMer with and without training data also support this observation. Secondly, there indeed exist some experimentally validated data (such as the Coriell data set). One would expect more and more such data because eventually, for any prediction approaches, one has to experimentally verify some of their predictions. In addition, the robustness of our approach suggests that we most likely only need to train our algorithm for each specific platform. The parameters can then be used for future data to be generated on the same platform.

In terms of computation costs, CRF-CNV has two separate portions: time for training and time for prediction. The training requires intensive computations to optimize the log-likelihood and to determine the hyperparameters. In addition, one can also perform k-fold cross-validations, which will require much more computational time. In contrast, once the parameters have been estimated, the prediction phase is rather efficient. Fortunately, the training phase of our algorithm (p.427) only requires a small data set, which makes the algorithm still practically useful. Possible extensions and applications of our algorithm can be applied to other high-throughput technologies for detecting copy number alterations.

References

Bibliography references:

[1] N.P. Carter. Methods and strategies for analyzing copy number variation using DNA microarrays. Nature Genetics, 39(7 Suppl):S16–21, 2007.

[2] D.Y. Chiang, G. Getz, D.B. Jaffe, M.J. O’Kelly, X. Zhao, S.L. Carter, C. Russ, C. Nusbaum, M. Meyerson, and E.S. Lander. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nature Methods, 6(1):99–103, 2009.

[3] E.K. Cho, J. Tchinda, J.L. Freeman, Y.J. Chung, W.W. Cai, and C. Lee. Array-based comparative genomic hybridization and copy number variation in cancer research. Cytogenetic and Genome Research, 115(3–4):262–272, 2006.

[4] S. Colella, C. Yau, J.M. Taylor, G. Mirza, H. Butler, P. Clouston, A.S. Bassett, A. Seller, C.C. Holmes, and J. Ragoussis. QuantiSNP: an objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Research, 35(6):2013–2025, 2007.

[5] P.H. Eilers and R.X. de Menezes. Quantile smoothing of array CGH data. Bioinformatics, 21(7):1146–1153, 2005.

[6] J.L. Freeman, G.H. Perry, L. Feuk, R. Redon, S.A. McCarroll, D.M. Altshuler, H. Aburatani, K.W. Jones, C. Tyler-Smith, M.E. Hurles, N.P. Carter, S.W. Scherer, and C. Lee. Copy number variation: new insights in genome diversity. Genome Research, 16(8):949–961, 2006.

[7] S. Guha, Y. Li, and D. Neuberg. Bayesian hidden Markov modeling of array CGH data. Journal of the American Statistical Association, 103(482):485–497, 2008.

[8] L. Hsu, S.G. Self, D. Grove, T. Randolph, K. Wang, J.J. Delrow, L. Loo, and P. Porter. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6(2):211–226, 2005.

[9] P. Hupe, N. Stransky, J.P. Thiery, F. Radvanyi, and E. Barillot. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20(18):3413–3422, 2004.

[10] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Eighteenth International Conference on Machine Learning, pages 282–289. Morgan Kaufmann Publishers, 2001.

[11] W.R. Lai, M.D. Johnson, R. Kucherlapati, and P.J. Park. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21(19):3763–3770, 2005.

[12] A.B. Olshen, E.S. Venkatraman, R. Lucito, and M. Wigler. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5(4):557–572, 2004.

[13] D. Pinkel, R. Seagraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, C. Collins, W.L. Kuo, C. Chen, Y. Zhai, S.H. Dairkee, B.M. Ljung, J.W. Gray, and D.G. Albertson. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics, 20(2):207–211, 1998.

[14] R. Redon, S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, H. Fiegler, M.H. Shapero, A.R. Carson, W. Chen, E.K. Cho, S. Dallaire, J.L. Freeman, J.R. Gonzalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J.R. MacDonald, C.R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M.J. Somerville, J. Tchinda, A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, L. Armengol, D.F. Conrad, X. Estivill, C. Tyler-Smith, N.P. Carter, H. Aburatani, C. Lee, K.W. Jones, S.W. Scherer, and M.E. Hurles. Global variation in copy number in the human genome. Nature, 444(7118):444–454, 2006.

[15] (p.428) S.L. Scott. Bayesian methods for Hidden Markov Models: recursive computing in the 21st century. Journal of the American Statistical Association, 97(457):337–351, 2002.

[16] S.P. Shah, X. Xuan, R.J. DeLeeuw, M. Khojasteh, W.L. Lam, R. Ng, and K.P. Murphy. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics, 22(14):e431–e439, 2006.

[17] F. Shen, J. Huang, K.R. Fitch, V.B. Truong, A. Kirby, W. Chen, J. Zhang, G. Liu, S.A. McCarroll, K.W. Jones, and M.H. Shapero. Improved detection of global copy number variation using high density, non-polymorphic oligonucleotide probes. BMC Genetics, 9(27):1471–2156, 2008.

[18] A.M. Snijders, N. Nowak, R. Segraves, S. Blackwood, N. Brown, J. Conroy, G. Hamilton, A.K. Hindle, B. Huey, K. Kimura, S. Law, K. Myambo, J. Palmer, B. Ylstra, J.P. Yue, J.W. Gray, A.N. Jain, D. Pinkel, and D.G. Albertson. Assembly of microarrays for genome-wide measurement of DNA copy number. Nature Genetics, 29(3):263–264, 2001.

[19] J.A. Snyman. Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-based Algorithms. Springer, New York, 2005.

[20] S. Solinas-Toldo, S. Lampel, S. Stilgenbauer, J. Nickolenko, A. Benner, H. Dohner, T. Cremer, and P. Lichter. Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes, Chromosomes and Cancer, 20(4):399–407, 1997.

[21] C. Sutton and A. McCallum. An introduction to conditional random fields for relational learning. Introduction to Statistical Relational Learning, chapter 93:142–146, 2007.

[22] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In A. Darwiche and N. Friedman, editors, Proceedings of the Eighteenth Conference in Uncertainty in Artificial Intelligence (UA02), pages 485–492. Morgan Kaufmann Publishers, 2002.

[23] K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner, S.F. Grant, H. Hakonarson, and M. Bucan. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research, 17(11):1665–1674, 2007.

[24] H. Willenbrock and J. Fridlyand. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21(22):4084–4091, 2005.

[25] X.L. Yin and J. Li. Detecting copy number variations from array CGH data based on a conditional random field model. Journal of Bioinformatics and Computational Biology, 8(2):295–314, 2010.

## Notes:

(1) Recall that for a vector $b=(b1,b2,…,bn)$ of length n, L2-norm of b is computed as $b12+b22+⋯+bn2$.