Detection of Copy Number Variations from Array Comparative Genomic Hybridization Data Using Linearchain Conditional Random Field Models
Detection of Copy Number Variations from Array Comparative Genomic Hybridization Data Using Linearchain Conditional Random Field Models
Abstract and Keywords
Copy number variation (CNV) accounts for roughly 12% of the human genome. Beside their inherent role in cancer development, CNVs have been reported to underlie susceptibility to complex diseases. Each variation may range from around 1000 nucleotides to less than 5 megabases. Array comparative genomic hybridization (aCGH) allows the identification of copy number alterations across genomes. The key computational challenge in analyzing CNVs using aCGH data is the detection of segment boundaries of copy number changes and inference of the copy number state for each segment. Markov random fields and, more specifically, conditional random fields provide a unified framework for data preprocessing, segmentation and copy number state decoding.
Keywords: copy number variation, aCGH, conditional random fields
Probabilistic Graphical Models for Genetics, Genomics, and Postgenomics. First Edition. Christine Sinoquet & Raphaël Mourad (Eds).© Oxford University Press 2014. Published in 2014 by Oxford University Press.
To gain more understanding of the role of inheritable copy number polymorphisms (CNPs) in determining disease phenotypes, systematic mapping and cataloging of CNPs are being carried out. Array comparative genomic hybridization (aCGH) allows identification of copy number alterations across genomes. The key computational challenge in analyzing copy number variations (CNVs) using aCGH data is the detection of segment boundaries of copy number changes and inference of the copy number state for each segment. This chapter proposes a novel undirected graphical model based on conditional random fields for CNV detections in aCGH data. This model effectively combines data preprocessing, segmentation, and copy number state decoding into one unified framework. The approach depicted, termed CRFCNV, provides great flexibility in defining meaningful feature functions. Therefore, it can effectively integrate local spatial information of arbitrary sizes into the model. For model parameter estimation, the conjugate gradient (CG) method has been adopted to optimize the likelihood, and (p.410) efficient forward/backward algorithms have been developed within the CG framework. The final step of copy number decoding is based on the Viterbi algorithm. The method was evaluated using real data as well as data simulated with realistic assumptions. Finally, the method is compared with two popular, publicly available programs.
16.1 Introduction
Structure variations in DNA sequences such as inheritable copy number alterations have been reported to be associated with numerous diseases. It has also been observed that somatic chromosomal aberrations (i.e., amplifications and deletions) in tumor samples have shown different clinical or pathological features in different cancer types or subtypes [3, 6, 14]. To gain more understanding of the role of inheritable copy number polymorphisms (CNPs) in determining disease phenotypes, systematic mapping and cataloging of CNPs are needed and are being carried out. Identification of somatic copy number aberrations in cancer samples may lead to the discovery of important oncogenes or tumor suppress genes. With remarkable capacity from existing technologies in assessing copy number variants (CNVs), there is a great wave of interest recently from the research community to investigate inheritable as well as somatic CNVs [1, 3, 4, 6, 14, 16, 23].
Broadly speaking, there are essentially three technological platforms for CNV detection: arraybased technology (including array comparative genomic hybridization (aCGH) [13, 20], as well as many other variants), single nucleotide polymorphism (SNP) genotyping technology [1, 14] and nextgeneration sequencing technology [2]. Arraybased technology measures DNA copy number alterations from a disease/test sample relative to a normal/reference sample. In principle, for each gene (or a genomic segment), each individual inherits one copy from its father and one copy from its mother. The total number of copies is two. However, when there are copy number mutations in an individual, the total number of copies might be one (i.e., deletions), or three or more (i.e., amplifications/insertions). For each DNA fragment or “clone,” aCGH can indirectly measure the number of copies through fluorescence intensities, which represent the abundance of DNAs hybridized to the segment. The intensity ratio of the test and the reference sample, measured in log scale, is expected to be proportional to the copy number change of the test sample relative to the reference sample (assumed to have two copies), though significant noise can be introduced from various sources during the process. Arraybased technologies are primarily for large segments of amplifications and deletions, though different experimental platforms and designs using clones with different sizes may give very different resolutions and genome coverage [14]. The other two platforms, SNP genotyping and nextgeneration sequencing, use different techniques to measure copy number changes, and have gained popularity in recent years [2, 17]. We primarily focus on aCGH data analysis in this chapter.
Different platforms have different challenges. Naturally, one should use different approaches for these platforms by taking advantage of special properties from different data sets. Not surprisingly, various algorithms have been proposed for different data in recent years. On the other hand, the primary goal of all such studies is to identify contiguous sets of segments that share the same copy numbers. Therefore, the essential computational task for processing data from different platforms is the same: segment genome into discrete regions of the same copy number. Once the above segmentation step is done, each segment will be assigned a copy number. The latter task is called “classification”. One important commonality in performing segmentation for data from different platforms is the spatial correlation among clones. Many existing approaches have (p.411) taken advantage of such a property by utilizing the same methodology, Hidden Markov Models (HMMs), which can conveniently model spatial dependence using a chain structure. Results have shown initial success of HMMs [1, 4, 16, 23]. However, there is an inherited limitation for all these HMMs, i.e., they are all firstorder HMMs and cannot take into consideration longrange dependence.
It has been shown that, in a variety of domains, conditional random fields (CRFs) consistently outperform HMMs, mainly because CRFs can potentially integrate all information from data [21]. This property makes CRFs particularly appealing to model CNV data since one can define feature functions using data from a region rather than a single or two data points for emissions and transitions in HMMs. We have proposed a novel undirected graphical model based on CRFs for CNV detections for aCGH data [25]. Our model effectively combines data preprocessing, segmentation, and copy number state decoding into one unified framework. Our approach (termed CRFCNV) provides great flexibilities in defining meaningful feature functions; therefore it can effectively integrate local spatial information of arbitrary sizes into the model. For model parameter estimations, we have adopted the conjugate gradient (CG) method for likelihood optimization and developed efficient forward/backward algorithms within the CG framework. The final step of copy number decoding is based on the Viterbi algorithm. The method is evaluated using real data with known copy numbers as well as simulated data with realistic assumptions, and compared with two popular publicly available programs. Experimental results have demonstrated that CRFCNV outperforms a Bayesian Hidden Markov Modelbased approach on both data sets in terms of copy number assignments. When compared to a nonparametric approach, CRFCNV has achieved much greater precision while maintaining the same level of recall on the real data, and the performances of the two methods are comparable when applied to simulated data.
The remainder of this chapter is organized as follows. In Section 16.2, we give a brief overview of aCGH data and existing approaches for detecting CNVs from aCGH data. Details about CRF model developments for CNVs and implementations are provided in Section 16.3. Our experimental results on two data sets and comparisons with other programs are presented in Section 16.4. We conclude this chapter with a few discussions in Section 16.5.
16.2 aCGH Data and Analysis
16.2.1 aCGH Data
Though, theoretically, our approach can be applied to data from different experimental platforms, we focus primarily on aCGH data in this analysis. Mathematically, aCGH data usually consist of an array of log_{2} intensity ratios for a set of clones, as well as the physical position of each clone along a genome. Here, each clone is one data point in the array and all the clones are ordered according to their physical positions on the genome. Fig. 16.1(page 412) plots the normalized log_{2} ratio of a human cancer cell line (i.e., sample) and normal reference DNA analyzed by Snijders et al. [18]. Each data point represents one genome segment/clone and the yaxis represents normalized log_{2} intensity ratio. The primary goal in CNV detection based on aCGH is to segment a genome into discrete regions that share the same mean log_{2} ratio pattern (i.e., have the same copy numbers). In a normal sample from humans, the copy number is usually two for all the autosomes. Ideally, the log_{2} ratio of a clone should be 0 if a test sample (e.g., a cancer cell line) also has two copies of DNA, and the value should be 0.585 (or − 1), if it has one singlecopy gain (p.412) (or singlecopy loss), etc. The dotted lines in Fig. 16.1 from bottom to up show the three values for one, two, and three copies, respectively. A straightforward approach is to use some global thresholds to assign the number of copies for each clone based on its log_{2} ratio. However, as shown in Fig. 16.1, aCGH data can be quite noisy with vague boundaries between different segments. It may also have complex local spatial dependence structure. These properties make the segmentation problem intrinsically hard. Approaches using global thresholds generally do not work well in practice.
16.2.2 Existing Algorithms
In general, a number of steps are needed to detect copy number changes from aCGH data. First, raw log_{2} ratio data usually needs some preprocessing, including normalization and smoothing. Normalization is an absolutely necessary step to alleviate systemic errors due to experimental factors. Usually the input data are normalized by taking advantage of some control data sets where there are no differences in test and reference DNA. The purpose of normalization is to make the median or mean log_{2} ratios in control data sets to be 0. Smoothing is used to reduce noise that is due to random errors or abrupt changes. Smoothing methods generally filter the data using a sliding window, attempting to fit a curve to the data while handling abrupt changes and reducing random errors. In addition to methods based on sliding windows, several other techniques have been proposed, which include quantile smoothing, wavelet smoothing, etc. (see [5, 8]).
The second step in analyzing aCGH data is referred to as segmentation and it aims to identify segments that share the same mean log_{2} ratio. Broadly speaking, there are two related estimation problems. One is to infer the number and statistical significance of the alterations; the other is (p.413) to locate their boundaries accurately. A few different algorithms have been proposed to solve these two estimation problems. Olshen et al. [12] have proposed a nonparametric approach based on the recursive circular binary segmentation (CBS) algorithm. Hupe et al. [9] have proposed an approach called GLAD, which is based on a median absolute deviation model, to separate outliers (observations that are numerically distant from the rest of the data) from their surrounding segments. Willenbrock and Fridlyand [24] have compared the performance of CBS (implemented in DNACopy) and GLAD using a realistic simulation model, and it was concluded that CBS in general is better than GLAD. We have adopted the simulation model from [24] in our experiment study. After obtaining the segmentation outcomes, a postprocessing step is needed to combine segmentations with similar mean levels and to classify them as singlecopy gain, singlecopy loss, normal, multiple gains, etc. Methods such as GLADMerge [9] and MergeLevels [24] can postprocess the segmentation results and label them accordingly.
As noted by Willenbrock and Fridlyand [24], it is more desirable to perform segmentation and classification simultaneously. An easy way to merge these two steps is to use a linear chain HMM. The underlying hidden states are the real copy numbers. Given a state, the log_{2} ratio can be modeled using a Gaussian distribution. The transition from one state to another state reveals the likelihood of copy number changes between adjacent clones. Given observed data, standard algorithms (forward/backward, Baum–Welch and Viterbi) can be used to estimate parameters and to decode hidden states. A few variants of HMMs have been proposed for aCGH data in recent years [7, 16]. Lai et al. [11] have shown that HMMs performed the best for small aberrations given a sufficient signal/noise ratio. Guha et al. [7] have proposed a Bayesian HMM which can impose biological meaningful priors on the parameters. Shah et al. [16] have extended this Bayesian HMM by incorporating outliers and locationspecific priors, which can be used to model inheritable copy number polymorphisms.
Notice that all these models are firstorder HMMs which cannot capture longrange dependence. Intuitively, it makes sense to consider highorder HMMs to capture informative local correlation, which is an important property observed from aCGH data. However, considering higher orders will make HMMs more complex and computationally intensive.
16.3 Linearchain CRF Model for aCGH Data
To overcome the limitations of HMMs, we propose a new model based on the theory of linearchain conditional random fields (CRFs) [10, 21]. CRFs are undirected graphical models designed for calculating the conditional distribution of output random variables Y given input variables X. The term “random field,” which is equivalent to “Markov network” in graphical theory, has been widely used in statistical physics and computer vision. CRFs are also known as conditional Markov networks [22]. Because CRFs are conditional models, dependences among variables X do not need to be explicitly specified. Therefore, one can define meaningful feature functions that can effectively capture local spatial dependence by using input variables X. CRFs have been widely applied to language processing, computer vision, and bioinformatics, with remarkable performance when compared with directed graphical models including HMMs.
In general, a linearchain CRF (see Fig. 16.2, page 414) is defined as the conditional distribution
(p.414) where the partition function
Here $\mathrm{\theta}(\mathrm{\theta}=\{{\mathrm{\theta}}_{ij}\})$ are parameters. Functions $\{{f}_{ij}\}$ are feature functions. ${\tilde{X}}_{i}$ is a neighbor set of X_{i} that is needed for computing features relating clones i and i + 1. S(i) is the total number of feature functions related to Y_{i} and Y_{i+1}. The variables connected by each square node in Fig. 16.2 are used to define potential feature functions. Each feature function will operate on pairs of adjacent label variables Y_{i} and Y_{i+1} and a neighbor set of X_{i}.
We define the linearchain CRF model for aCGH data for one sample (actually one chromosome in one sample because different chromosomes can be treated independently) as follows. The general model for a set of observations can be easily obtained assuming independence between observations. Let $X=({X}_{1},\dots ,{X}_{n})$ denote the normalized log_{2} ratio intensities along one chromosome for a sample/cell line, where X_{i} is the log_{2} ratio for clone i. One can assume that these n clones are sequentially positioned on a chromosome. Let $Y=({Y}_{1},\dots ,{Y}_{n})$ denote the corresponding hidden copy number states, where Y_{i} belongs to $\{1,..,s\}$ and s is the total number of copy number states. These states usually indicate deletion, singlecopy loss, neutral, singlecopy gain, twocopy gain, or multiplecopy gain. The exact number of states and their meaning need to be specified based on specific input data. The conditional probability of Y given the observed log_{2} ratio X based on our linearchain CRF structure is defined as
(p.415) where the partition function is:
Here $\mathrm{\theta}=\{{\mathrm{\lambda}}_{j},{\mathrm{\mu}}_{j},{\mathrm{\omega}}_{j},{\mathrm{\nu}}_{jk}\}$ are parameters that need to be estimated. Functions ${f}_{j},{g}_{j},{l}_{j}$, and h_{jk} are feature functions that need to be defined. ${\tilde{X}}_{i}(u)$ is defined as a neighbor set of X_{i} around clone i, i.e., ${\tilde{X}}_{i}(u)=\{{X}_{iu},\dots ,{X}_{i1},{X}_{i},{X}_{i+1},\dots ,{X}_{i+u}\}$, where u is a hyperparameter meant to define the dependence length. In principle, u can be defined based on physical distances between clones instead of the number of clones, which may better capture local dependence when clones are unevenly distributed on a chromosome. For simplicity, we use the number of clones here. For $i\le u$ or $i\ge nu$, ${\tilde{X}}_{i}(u)$ is redefined by proper truncation. For example ${\tilde{X}}_{1}(3)=\{{X}_{1},\dots ,{X}_{4}\}$. Similarly, we define ${\tilde{X}}_{i,i+1}(u)=\{{X}_{iu},\dots ,{X}_{i1},{X}_{i},{X}_{i+1},{X}_{i+2},\dots ,{X}_{i+1+u}\}$, ${\tilde{X}}_{i,i+1}^{}(u)=\{{X}_{iu},\dots ,{X}_{i1},{X}_{i}\}$ and ${\tilde{X}}_{i,i+1}^{+}(u)={X}_{i+1},{X}_{i+2},\dots ,{X}_{i+1+u}\}$. The two latter terms will be further used.
As illustrated in Fig. 16.2, although we also use a chain structure in our CRF model dedicated to CNV detection, the feature functions to be defined rely on observed data from a region. We define our feature functions by integrating all information from these neighborhood sets. Therefore, these feature functions can capture abundant local spatial dependence. The dependence length u plays a similar role as the width of the sliding window in smoothing methods. In addition, by using a CRF, we can effectively combine smoothing, segmentation, and classification into one unified framework. For notational simplification, we drop the parameter u in our subsequent discussions and write ${\tilde{X}}_{i}(u)$ as ${\tilde{X}}_{i}$, ${\tilde{X}}_{i,i+1}(u)$ as ${\tilde{X}}_{i,i+1}$, and so on.
16.3.1 Feature Functions
One important step for building our model is to define meaningful feature functions that can capture critical information from input data. Essentially, we define two types of feature functions, analogous to the emission and transition probabilities in HMMs. However, our feature functions can be of any form. Therefore, our model can provide much more flexibility and be able to capture longrange dependence. The emission feature functions ${f}_{j}({Y}_{i},{\tilde{X}}_{i})$ and ${g}_{j}({Y}_{i},{\tilde{X}}_{i})$ are defined as follows:
where $med\text{}{\tilde{X}}_{i}$ is defined as the median value of set ${\tilde{X}}_{i}$. Our emission features serve two purposes. First they are used as median filters that will automatically smooth the input data. More (p.416) importantly, the feature functions based on the firstorder and secondorder median statistics are robust sufficient statistics one can derive from a normal distribution.
The transition feature function ${h}_{jk}({Y}_{i},{Y}_{i+1},{\tilde{X}}_{i,i+1})$ and the initial feature function ${l}_{j}({Y}_{1},{\tilde{X}}_{1})$ are defined as follows:
${h}_{jk}({Y}_{i},{Y}_{i+1},{\tilde{X}}_{i,i+1})=$
Here a_{j} denotes the mean log_{2} ratio for clones with copy number state $j(j=1,\dots ,s$), while a_{0} and a_{s+1} denote the greatest lower bound of log_{2} ratio for clones with copy number state 1 and the least upper bound of log_{2} ratio for clones with copy number state s, respectively. Without loss of generality, we assume ${a}_{0}<{a}_{1}<\cdots <{a}_{s+1}$. We define the initial feature function ${l}_{j}({Y}_{1},{\tilde{X}}_{1})$ in a way such that data from the clone set ${\tilde{X}}_{1}$ will only provide information to its own state. Furthermore, when Y_{1} is equal to j, the closer the value of $med\text{}{\tilde{X}}_{1}$ to a_{j}, the higher the value for ${l}_{j}({Y}_{1},{\tilde{X}}_{1})$, the more information the data will provide, resulting in a higher contribution to parameter ${\mathrm{\omega}}_{j}$. The function ${l}_{j}({Y}_{1},{\tilde{X}}_{1})$ will achieve its highest value of 1 when $med\text{}{\tilde{X}}_{1}$ is equal to a_{j}. The other terms (i.e., $({a}_{j+1}{a}_{j})/2$ and $({a}_{j}{a}_{j1})/2$) in ${l}_{j}({Y}_{1},{\tilde{X}}_{1})$ are scale factors, which make ${l}_{j}({Y}_{1},{\tilde{X}}_{1})$ to be 1/2 when the value of $med\text{}{\tilde{X}}_{1}$ is half way between a_{j1} and a _{j} or half way between ${a}_{j+1}$ and a _{j} (i.e., $med\text{}{\tilde{X}}_{1}=({a}_{j1}+{a}_{j})/2$, or $med\text{}{\tilde{X}}_{1}=({a}_{j+1}+{a}_{j})/2$). The transition feature function ${h}_{jk}({Y}_{i},{Y}_{i+1},{\tilde{X}}_{i,i+1})$ is similarly defined using the clone set ${\tilde{X}}_{i,i+1}$. When Y_{i} equals j and Y_{i+1} equals k, the closer the value of $med\text{}{\tilde{X}}_{i,i+1}^{}$ to a_{j} and the value of $med\text{}{\tilde{X}}_{i,i+1}^{+}$ to a_{k}, the higher the value of ${h}_{jk}({Y}_{i},{Y}_{i+1},{\tilde{X}}_{i,i+1})$ and the more information the data will contribute to ν_{jk}. Those scale factors serve the same purpose as those in the emission feature functions. Clearly, (p.417) both types of our feature functions can capture the local spatial dependence over a set of adjacent clones and thus potentially provide more robust inference about hidden copy number states.
16.3.2 Parameter Estimation
Unlike the standard algorithms for HMM training, there are significant computational challenges for efficiently and accurately estimating parameters for CRFs. For example, the partition function Z_{θ}(X) in model 16.1 needs to be explicitly calculated in order to estimate parameters. In contrast, there is no such partition function to be considered for HMMs. The implementation of the training algorithms for our proposed linearchain CRF model requires sophisticated statistical and numerical algorithms. To our best knowledge, no existing implementations can be used to solve this problem. We propose the following algorithm for the parameter estimation.
In general, given a set of training data $\mathcal{D}$ consisting of D observations, i.e., individuals $\mathcal{D}=\{({X}^{(d)},{Y}^{(d)}),d=1,\dots ,D\}$, to estimate parameter θ ($\mathrm{\theta}=\{{\mathrm{\lambda}}_{j},{\mathrm{\mu}}_{j},{\mathrm{\omega}}_{j},{\mathrm{\nu}}_{jk}\}$) in model 16.1, one needs to maximize a penalized conditional loglikelihood which is defined as follows:
Here $\parallel \mathrm{\theta}\parallel $ is the L2norm of θ.^{1} The penalization term $\parallel \mathrm{\theta}{\parallel}^{2}/2{\mathrm{\sigma}}^{2}$ is added for regularization, which is used to prevent overfitting. Before one can solve the optimization problem, one has to first specify an additional set of hyperparameters that include the dependence length u, the mean log_{2} ratios $\{{a}_{j},j=0,\dots ,s+1\}$ and the penalization coefficient σ^{2}. For each j belonging to $\{1,\dots ,s\}$, the set of $\{{a}_{j}\}$ can be directly estimated given the training data set $\mathcal{D}$, i.e., the maximum likelihood estimate of a_{j} is just the mean value of the log_{2} ratios for all clones with copy number state j in $\mathcal{D}$ for $j=1,\dots ,s$. In addition, a_{0} and a_{s}+1 can be imputed using the minimum log_{2} ratio of all clones with copy number state 1, and the maximum value from all clones with copy number state s, respectively. For the dependent length u and the penalization coefficient σ^{2}, we rely on a gridsearch approach through crossvalidation, which performs an exhaustive search by partitioning the search space into small grids. More specifically, the original training set $\mathcal{D}$ will first be partitioned into two sets ${\mathcal{D}}_{1}$ and ${\mathcal{D}}_{2}$. We call ${\mathcal{D}}_{1}$ the new training set and ${\mathcal{D}}_{2}$ the validation set. For a given range of (discrete) parameter values of u and σ^{2}, we train the model on ${\mathcal{D}}_{1}$ and get estimates of θ for each fixed pair of (u_{0}, ${\mathrm{\sigma}}_{0}^{2}$). The exact procedure to estimate θ given (p.418) (u_{0}, ${\mathrm{\sigma}}_{0}^{2}$) will be further discussed shortly. We then apply the trained model with estimated parameters on the validation set ${\mathcal{D}}_{2}$ and record the prediction error under the current model. The model with the smallest prediction error as well as the associated parameters ($u,{\mathrm{\sigma}}^{2},\mathrm{\theta}$) will be kept. The prediction error is defined as the mean absolute error for all samples in the validation set ${\mathcal{D}}_{2}$. The absolute error for a clone i is defined as $\mid {Y}_{i}{\stackrel{\u02c6}{Y}}_{i}\mid $, where Y_{i} is the known copy number and ${\stackrel{\u02c6}{Y}}_{i}$ is the predicted copy number. This measure not only captures whether a prediction is exactly the same as the real copy number but also reflects how close these two numbers are.
For a given set of hyperparameters $\{{a}_{j}\}$, u, and σ^{2}, the optimization of L_{θ} in equation 16.2 can be solved using gradientbased numerical optimization methods [19]. We choose the nonlinear conjugate gradient (CG) method in our implementation, which generalizes the CG method to nonlinear optimization. To find a local optimum of a nonlinear function, the nonlinear CG method only needs to calculate the gradient, i.e., the first derivatives of L_{θ} in our problem. The outline of our nonlinear CG method is as follows. Given the objective function L_{θ}, we search in the gradient direction
where $\frac{\mathrm{\partial}{L}_{\mathrm{\theta}}}{\mathrm{\partial}\mathrm{\theta}}$ is the first derivative function of L_{θ} and ${\mathrm{\theta}}^{(0)}$ is the initial value of θ. Then, we perform a line search in this direction until L_{θ} reaches the maximum and update the parameter θ accordingly:
After this first iteration in the gradient direction ${d}^{(0)}$, the following steps constitute one iteration of moving along a subsequent conjugate direction d^{(i)} and update θ accordingly:
1. Calculate the gradient direction ${\mathrm{\xi}}^{(i)}=\frac{\mathrm{\partial}{L}_{\mathrm{\theta}}}{\mathrm{\partial}\mathrm{\theta}}{\mid}_{\mathrm{\theta}={\mathrm{\theta}}^{(i)}}$,

2. compute ${\text{\beta}}^{(i)}=\frac{{({\text{\xi}}^{(i)})}^{\prime}{\text{\xi}}^{(i)}}{{({\text{\xi}}^{(i1)})}^{\prime}{\text{\xi}}^{(i1)}}$,

3. update the conjugate direction ${d}^{(i)}={\mathrm{\xi}}^{(i)}+{\mathrm{\beta}}^{(i)}{d}^{(i1)}$,

4. perform a line search: optimize ${\text{\alpha}}^{(i)}=\mathrm{arg}\underset{\text{\alpha}}{\mathrm{max}}L\left({\text{\theta}}^{(i)}+\text{\alpha}{d}^{(i)}\right)$,

5. update ${\mathrm{\theta}}^{(i+1)}={\mathrm{\theta}}^{(i)}+{\mathrm{\alpha}}^{(i)}{d}^{(i)}$.
The CG procedure runs until convergence. The final θ is regarded as the optimal parameter. The firstorder derivatives of L_{θ} with respect to $\{{\mathrm{\lambda}}_{j}\}$, $\{{\mathrm{\mu}}_{j}\}$, $\{{\mathrm{\omega}}_{j}\}$, and $\{{\mathrm{\nu}}_{jk}\}$ are given by
The marginal distributions $\mathbb{P}({Y}_{i}=y{X}^{(d)}),(i=1,\dots ,n)$ and $\mathbb{P}({Y}_{i}=y,{Y}_{i+1}={y}^{\ast}{X}^{(d)}),(i=1,\dots ,n1)$ in the above derivatives can be efficiently computed using the forward–backward algorithms. First, define the following forward variables γ_{i} and backward variables η_{i} for i = 1,…,n:
where the functions F, G, L and H are defined based on the feature functions:
(p.420) The marginal distributions can be obtained by combining forward and backward variables:
and ${Z}_{\mathrm{\theta}}({X}^{(d)})$ can also be efficiently computed using forward variables ${Z}_{\mathrm{\theta}}({X}^{(d)})=\sum _{y=1}^{s}{\mathrm{\gamma}}_{n}(y,d).$
We noticed that the calculation of the forward variables γ_{i} and backward variables η_{i} is not stable. It may suffer from overflows due to numerous products of exponentials. To achieve numerical stability, we take a log transformation in calculating the recursions. Similar techniques have been used in HMMs [15]. To calculate γ_{i}(y,d) according to the formula defined earlier, first, let ${y}_{0}=arg{\displaystyle \underset{y}{max}log{\mathrm{\gamma}}_{i1}(y,d)}$. We will then divide each component in the recursive formula of γ_{i}(y,d) before exponentiation by ${\mathrm{\gamma}}_{i1}({y}_{0},d)$. After taking log transformation, the new recursive relationship becomes:
The backward variables ${\mathrm{\eta}}_{i}(y,d)$ can be transformed similarly:
where ${z}_{0}=arg{\displaystyle \underset{y}{max}log{\mathrm{\eta}}_{i+1}(y,d)}$.
The marginal distributions in formulae 16.3 and 16.4 can be calculated using stable $log{\mathrm{\gamma}}_{i}$ and $log{\mathrm{\eta}}_{i}$. For example, $\mathbb{P}({Y}_{i}=y{X}^{(d)})=exp\{log{\mathrm{\gamma}}_{i}(y,d)+log{\mathrm{\eta}}_{i}(y,d)log{Z}_{\mathrm{\theta}}({X}^{(d)})\},y=1,\dots ,s;i=1,\dots ,n,$ where $log{Z}_{\mathrm{\theta}}({X}^{(d)})=log{\mathrm{\gamma}}_{n}({y}_{0},d)+log(1+\sum _{y\ne {y}_{0}}exp[log{\mathrm{\gamma}}_{n}(y,d)log{\mathrm{\gamma}}_{n}({y}_{0},d)]),$ ${y}_{0}=arg{\displaystyle \underset{y}{max}log{\mathrm{\gamma}}_{n}(y,d).}$
(p.421) It is important to discuss the computational cost of training. The partition function ${Z}_{\mathrm{\theta}}(X)$ in the likelihood and the marginal distributions $\mathbb{P}({Y}_{i}=y{X}^{(d)})$, $\mathbb{P}({Y}_{i}=y,{Y}_{i+1}={y}^{\ast}{X}^{(d)})$ in the gradient can be efficiently computed using the standard forward–backward algorithms. The forward–backward algorithm required for the three computations has an $O(n{s}^{2})$ complexity, where n is the number of clones and s is the number of hidden states. However, each training data set will have its own partition function and marginal distributions, so we need to run a forward–backward procedure for each observation in the training data set and for each gradient computation. Therefore, the total cost for training is $O(n{s}^{2}dg)$, where d is the number of observations in the training data set, and g is the number of gradient computations required by the optimization procedure.
For graphicalmodelbased approaches such as HMMs, many researchers pool data across all chromosomes from all samples, which can dramatically reduce the number of parameters needed without sacrificing much on inference accuracy. We also take a similar approach. This is reflected by our homogeneous linearchain CRF structure.
16.3.3 Evaluation Methods
We implemented the above proposed approach as a Matlab package termed CRFCNV and evaluated its performance using a publicly available real data set with known copy numbers [18] and a synthetic data set generated in [24]. Notice that many clones have normal (two) copies of DNA; therefore, the number of correctly predicted state labels is not a good measure of performance of an algorithm. Instead, we compare the performance of CRFCNV with two popular programs in terms of the number of predicted segments and the accuracy of segment boundaries, referred to as breakpoints. To summarize the performance of an algorithm over multiple chromosomes and individuals, we use a single value called Fmeasure, which is a combination of precision and recall. Recall that given the true copy number state labels and predicted labels, precision (P) is defined as $\frac{ntp}{np}$ and recall (R) is defined as $\frac{ntp}{nt}$, where ntp is the number of true positive (i.e., correctly predicted) breakpoints, np is the number of predicted breakpoints, and nt is the number of true breakpoints. Fmeasure is defined as F=2PR/(P+R), which aims to find a balance between precision and recall. The two programs we chose are CBS [12] and CNAHMMer [16], both of which have been implemented as Matlab tools. As mentioned earlier, CBS is one of the most popular segmentation algorithms, and different groups have shown it in general performs better than many other algorithms. CNAHMMer was chosen because we wanted to compare the performance of our CRF model with HMMs, and CNAHMMer is an implementation of a Bayesian HMM model with high accuracy [16].
16.4 Experimental Results
16.4.1 A Real Example
The Coriell data are regarded as a wellknown “gold standard" data set which was originally analyzed in [18]. The data can be downloaded from http://www.nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html and have been widely used in testing new algorithms and in comparing different algorithms. The CBS algorithm was applied to this data set in the original paper (p.422) [12]. The Coriell data consist of 15 cell lines, named GM03563, GM00143, …, GM01524. We simply use the numbers 1,2,…,15 to represent these cell lines. For this particular data set, there are only three states (s = 3), i.e., singlecopy loss, normal copy, and singlecopy gain. Notice that unlike CBS, CRFCNV requires training data to estimate parameters. It is unfair to directly compare the prediction results of CRFCNV on training data with results from CBS. We took a simple approach which divides the 15 samples into three groups. Each group consists of 5 samples. In the first run, we used group 1 as training data and group 2 as validation data to obtain model parameters (as discussed in SubSection 16.3.2). We then used the model to predict data in group 3 (testing data), and recorded the prediction results. In the second and third run, we alternated the roles of groups 1–3 and obtained prediction results for samples in group 1 and group 2, respectively. Finally we summarized our results over all 15 samples. For example, for the first run, we first obtained $\{{a}_{j},j=0,\dots ,4\}$ directly based on samples in group 1. The estimates for $\{{a}_{j}\}$ were $1.348,0.682,0.001$, 0.497, 0.810. To estimate the penalization coefficient σ^{2} and the dependent length u, we defined the search space as $A\times B=\{0,1,2,\dots ,30\}\times \{0,1,\dots ,5\}$. For each value m in A, we let ${\mathrm{\sigma}}^{2}=400\times {0.8}^{m}$ and u=u_{0}. Essentially, to search σ^{2} in a broad range, we used a geometric decay. The upper bound on u was set as 5 because for aCGH data such as the Coriell data set, each clone can cover quite a long range of DNA. The optimal σ^{2} and u were chosen by minimizing the prediction errors on samples in group 2 (the validation set). Our results indicated that the model where u and m are respectively equal to 1 and 21 achieves the lowest prediction error. The values of θs corresponding to the best model with the lowest prediction errors were recorded. We then applied Viterbi’s algorithm to find the most possible hidden copy number states for samples in group 3, as well as the number and boundaries of segments. Run 2 and run 3 obtained results on group 1 and group 2. For the CNAHMMer, one could either use its default priors, or use training data to obtain informative priors. We tested the performance of CNAHMMer both with and without informative priors.
Table 16.1 Comparison of the number of segments identified along the genome, by three algorithms, for each sample of the Coriell data set.
Method/Sample 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
Sum 

Gold standard 
5 
3 
5 
3 
5 
3 
5 
5 
5 
5 
3 
5 
2 
3 
3 
60 
CRFCNV 
5 
3 
5 
3 
5 
3 
5 
5 
3 
3 
3 
5 
2 
3 
3 
56 
CBS 
17 
42 
7 
6 
9 
5 
5 
6 
5 
5 
13 
7 
17 
3 
9 
156 
CNAHMMer(default) 
9 
83 
9 
7 
11 
3 
7 
5 
11 
11 
21 
5 
16 
10 
16 
224 
CNAHMMer (trained) 
3 
3 
5 
3 
5 
3 
5 
5 
3 
3 
3 
5 
2 
3 
3 
54 
Table 16.1 shows the number of segments identified along the genome, for each individual from the gold standard. Table 16.1 similarly describes the predicted outcomes of the three algorithms CRFCNV, CBS, and CNAHMMer. The number of segments detected by CRFCNV is exactly the same as the gold standard for almost all samples (except for samples 9 and 10). Further examination of samples 9 and 10 (see Fig. 16.3) reveals that the segment that we missed in sample 9 only has one clone, which has been smoothed out by our algorithm. The segment (p.423) missed in sample 10 is also very short, and the signal is very weak. Our results clearly show that CBS generated many more segments comparing to the ground truth, which is consistent with the results in the original paper [18]. The overall number of segments reported by CNAHMMer with default priors was even greater than the total number from CBS. On the other hand, once we used training data to properly assign informative priors for CNAHMMer, it returned almost the same number of segments as CRFCNV. The only exception was that CNAHMMer missed one breakpoint in sample 1. This illustrates that by using correctly labeled training data, both CRFCNV and CNAHMMer can effectively eliminate all false positives in this data set. For the subsequent experiments, in the comparisons, we only report the results of CNAHMMer with proper training.
As a comparison measure, the number of segments is a very rough index because it does not contain information about breakpoints. To further examine with which accuracy the breakpoints are predicted by each approach, we pooled all the breakpoints from all the samples and used the Fmeasure to compare the performances of the three algorithms. Notice that even though exact matches were possible, shifting by a few clones around boundaries was also likely given noisy input data. Therefore we used a match extent index D to allow some flexibility in defining matches of predicted breakpoints to those given by the gold standard. Table 16.2 shows Fmeasures given different match extent values for CRFCNV, CNAHMMer, and CBS. Clearly, CBS has the worst performance regardless of the match extent values. This partially reflects that it yields many false positives. The result for CNAHMMer is quite accurate when no match extent is allowed. Then, the Fmeasure shows a modest increase as the value of D increases from 0 to 1. The result of CRFCNV lies in between when the match index D is equal to 0. However, the performance of CRFCNV is greatly enhanced when D is equal to 1, and finally it outperforms CNAHMMer when D (p.424) is greater than or equal to 2. The primary reason that CRFCNV shifted one or a few positions for many breakpoints is due to the automatic median smoothing step. In contrast, CNAHMMer directly models outliers using prior distributions.
Table 16.2 Comparison of Fmeasures with different match extent values for three algorithms.
Method/Match Extent 
0 
1 
2 
3 
4 

CRFCNV 
0.638 
0.914 
0.948 
0.967 
0.967 
CNAHMMer 
0.877 
0.947 
0.947 
0.947 
0.947 
CBS 
0.333 
0.500 
0.519 
0.519 
0.519 
16.4.2 Simulated Data
Though results on the real data show that CRFCNV has a higher performance than CBS and is comparable with CNAHMMer when the match extent index is relaxed to 2, the experiment is limited because the data set size is very small. To further evaluate the performance of CRFCNV, we tested the three algorithms using a simulated data set obtained from [24]. The data set consists of 500 samples each with 20 chromosomes. Each chromosome contains 100 clones. Each clone can be affected one of six possible copy number states. The authors generated these samples by sampling segments from a data set of primary breast tumor tissues and used several mechanisms (e.g., the fraction of cancerous cells in each sample, the variation of intensity values given a copy number state) to control the noise level. By using simulated data from the literature, we further evaluated the performance of CRFCNV.
Table 16.3 Comparison of the numbers of segments predicted by three different approaches run on simulated data.
Method/Data 
Group 2 
Group 3 

Gold 
997 
8299 
CRFCNV 
966 
8868 
CNAHMMer 
784 
6692 
CBS 
867 
7430 
To train CRFCNV, we divided the 500 samples into three groups as usual. This time, the training set group 1 contained samples 1–50, the validation set group 2 contained samples 51–100, and the test set group 3 contained samples 101–500. We used the same grid search approach as discussed earlier to obtain hyperparameters $\{{a}_{j}\}$, u, and σ^{2}. For each fixed set of hyperparameters, we used the CG method to obtain parameter θ. Finally, we used Viterbi’s algorithm to decode the most possible hidden copy number state labels for samples in group 3 and compared the results with CNAHMMer and CBS. In addition, we also compared the predictions by CRFCNV for group 2 and group 3 to see with new testing data how much deterioration our model might incur based on suboptimal parameters inferred from small number of samples. For easy comparison, (p.425) the results from CBS and CNAHMMer are presented separately for these two groups. We also used group 1 as training data to assign informative priors for CNAHMMer.
Table 16.3 shows the total numbers of segments in group 2 and group 3 predicted by CRFCNV, CBS, and CNAHMMer, respectively, and compares them with the known total number of segments. Interestingly, for this simulated data, both CBS and CNAHMMer predicted a smaller number of segments. CRFCNV predicted a smaller number of segments on group 2 and a greater number of segments in group 3. However, the number of segments does not provide a whole picture. We therefore examined the accuracy of boundary prediction by each method using the Fmeasure for both group 2 and group 3. Table 16.4 shows the Fmeasures for different methods, different groups, and different match extents. As expected, the Fmeasure increases as D increases from 0 to 4 for all methods and for both data groups. It is also not surprising to see that the results of CBS on group 2 and group 3 are consistent, like those for CNAHMMer with group 2 and group 3. Interestingly, the performance of CRFCNV with group 3 is also very close to its own performance with group 2. This property is desirable because it illustrates the robustness of CRFCNV. The performance with new testing data was almost the same as the performance with validation data, which were used to select optimal model parameters. Notice that the sizes of training data and validation data are also very small. One can expect that with a training data set of small size, our approach can be used to reliably predict other data generated under the same experimental conditions. In terms of the performance of the three approaches, CNAHMMer is more accurate than CRFCNV, and CBS is the worst for the case of exact match. However, when we relax the matching criterion by increasing the value of D, both CBS and CRFCNV achieve better performance than CNAHMMer. The results of CNAHMMer and CRFCNV are consistent with those from the real data. CBS shows significantly higher performances compared to those observed for real data. However this might be attributed to the simulation process, because CBS was used to segment the 145 samples from the primary breast tumor data set [24].
Table 16.4 Comparison of Fmeasures with different match extent values for three algorithms and different test sets.
Data 
Method/Match Extent 
0 
1 
2 
3 
4 

Group 2 
CRFCNV 
0.590 
0.792 
0.875 
0.900 
0.906 
CNAHMMer 
0.702 
0.801 
0.832 
0.852 
0.855 

CBS 
0.436 
0.850 
0.885 
0.900 
0.909 

Group 3 
CRFCNV 
0.568 
0.786 
0.864 
0.889 
0.896 
CNAHMMer 
0.697 
0.805 
0.840 
0.858 
0.869 

CBS 
0.436 
0.847 
0.893 
0.911 
0.918 
16.5 Conclusion
The problem of detecting copy number variations has drawn much attention in recent years, and many computational approaches have been proposed to solve this issue. Among these computational developments, CBS has gained much popularity, and it has been shown that it generally (p.426) performs better than other algorithms on simulated data [24]. However, as shown in the original paper (as well as rediscovered by our experiments), CBS was reported to yield many more false positives on copy number changes in the standard Coriell data set identified by spectral karyotyping [18]. Another commonly used technique for segmentation is the HMM. HMM approaches have the advantage of performing parameter (i.e., means and variances) estimation and copy number decoding within one framework, and their performance is expected to improve with more observations. However, almost all HMMs dedicated to aCGH are firstorder Markov models and thus cannot incorporate longrange spatial correlations within data.
We have presented a novel computational model based on the theory of conditional random fields. Our model provides great flexibility in defining meaningful feature functions over an arbitrary neighborhood. Therefore, more local information can be incorporated, and more robust results are expected. In this chapter, we have defined feature functions using robust sufficient statistics that can effectively incorporate smoothing into our unified framework. We have also developed effective forward–backward algorithms within the conjugate gradient method for efficient computation of model parameters. We evaluated our approach using real data as well as simulated data, and the results show that our approach performed better than a Bayesian HMM on both data sets when a small shift is allowed while mapping breakpoints. When compared with CBS, our approach has fewer false positives with the real data set. With the simulated data set, the performance of our approach is comparable to that of CBS, which has been shown to be the best among the three popular segmentation approaches.
Like any other CRFs, in order to train our model, one has to rely on some training data. To be practically useful, Bayesian HMMs such as CNAHMMer also need training data for proper assignment of informative priors. We argue that the problem is not as serious as it initially appears to be, primarily for two reasons. First, as illustrated in our experiments, our algorithm is indeed very robust and performs consistently, even if one cannot find optimal estimates for model parameters. For example, we used a simplified procedure in the analysis of the simulated data set by randomly picking one subset for training. Theoretically, parameters estimated from such a procedure might heavily depend on this particular subset and might not be necessarily globally optimal. However, the results shown in Table 16.4 demonstrate that the performance with new testing data is almost the same as with the validation data, which were used to tune the parameters. A 10fold crossvalidation is currently being performed to further verify this observation. Furthermore, the training size required by our algorithm is very small, as illustrated by both the real and the simulated data. The primary reason that our algorithm reported no false positives for segment numbers with the Coriell data is mainly that we learned the structure within the data by training. The drastic differences in performance by CNAHMMer with and without training data also support this observation. Secondly, there indeed exist some experimentally validated data (such as the Coriell data set). One would expect more and more such data because eventually, for any prediction approaches, one has to experimentally verify some of their predictions. In addition, the robustness of our approach suggests that we most likely only need to train our algorithm for each specific platform. The parameters can then be used for future data to be generated on the same platform.
In terms of computation costs, CRFCNV has two separate portions: time for training and time for prediction. The training requires intensive computations to optimize the loglikelihood and to determine the hyperparameters. In addition, one can also perform kfold crossvalidations, which will require much more computational time. In contrast, once the parameters have been estimated, the prediction phase is rather efficient. Fortunately, the training phase of our algorithm (p.427) only requires a small data set, which makes the algorithm still practically useful. Possible extensions and applications of our algorithm can be applied to other highthroughput technologies for detecting copy number alterations.
References
Bibliography references:
[1] N.P. Carter. Methods and strategies for analyzing copy number variation using DNA microarrays. Nature Genetics, 39(7 Suppl):S16–21, 2007.
[2] D.Y. Chiang, G. Getz, D.B. Jaffe, M.J. O’Kelly, X. Zhao, S.L. Carter, C. Russ, C. Nusbaum, M. Meyerson, and E.S. Lander. Highresolution mapping of copynumber alterations with massively parallel sequencing. Nature Methods, 6(1):99–103, 2009.
[3] E.K. Cho, J. Tchinda, J.L. Freeman, Y.J. Chung, W.W. Cai, and C. Lee. Arraybased comparative genomic hybridization and copy number variation in cancer research. Cytogenetic and Genome Research, 115(3–4):262–272, 2006.
[4] S. Colella, C. Yau, J.M. Taylor, G. Mirza, H. Butler, P. Clouston, A.S. Bassett, A. Seller, C.C. Holmes, and J. Ragoussis. QuantiSNP: an objective Bayes hiddenMarkov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Research, 35(6):2013–2025, 2007.
[5] P.H. Eilers and R.X. de Menezes. Quantile smoothing of array CGH data. Bioinformatics, 21(7):1146–1153, 2005.
[6] J.L. Freeman, G.H. Perry, L. Feuk, R. Redon, S.A. McCarroll, D.M. Altshuler, H. Aburatani, K.W. Jones, C. TylerSmith, M.E. Hurles, N.P. Carter, S.W. Scherer, and C. Lee. Copy number variation: new insights in genome diversity. Genome Research, 16(8):949–961, 2006.
[7] S. Guha, Y. Li, and D. Neuberg. Bayesian hidden Markov modeling of array CGH data. Journal of the American Statistical Association, 103(482):485–497, 2008.
[8] L. Hsu, S.G. Self, D. Grove, T. Randolph, K. Wang, J.J. Delrow, L. Loo, and P. Porter. Denoising arraybased comparative genomic hybridization data using wavelets. Biostatistics, 6(2):211–226, 2005.
[9] P. Hupe, N. Stransky, J.P. Thiery, F. Radvanyi, and E. Barillot. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20(18):3413–3422, 2004.
[10] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Eighteenth International Conference on Machine Learning, pages 282–289. Morgan Kaufmann Publishers, 2001.
[11] W.R. Lai, M.D. Johnson, R. Kucherlapati, and P.J. Park. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21(19):3763–3770, 2005.
[12] A.B. Olshen, E.S. Venkatraman, R. Lucito, and M. Wigler. Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics, 5(4):557–572, 2004.
[13] D. Pinkel, R. Seagraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, C. Collins, W.L. Kuo, C. Chen, Y. Zhai, S.H. Dairkee, B.M. Ljung, J.W. Gray, and D.G. Albertson. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics, 20(2):207–211, 1998.
[14] R. Redon, S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, H. Fiegler, M.H. Shapero, A.R. Carson, W. Chen, E.K. Cho, S. Dallaire, J.L. Freeman, J.R. Gonzalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J.R. MacDonald, C.R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M.J. Somerville, J. Tchinda, A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, L. Armengol, D.F. Conrad, X. Estivill, C. TylerSmith, N.P. Carter, H. Aburatani, C. Lee, K.W. Jones, S.W. Scherer, and M.E. Hurles. Global variation in copy number in the human genome. Nature, 444(7118):444–454, 2006.
[15] (p.428) S.L. Scott. Bayesian methods for Hidden Markov Models: recursive computing in the 21st century. Journal of the American Statistical Association, 97(457):337–351, 2002.
[16] S.P. Shah, X. Xuan, R.J. DeLeeuw, M. Khojasteh, W.L. Lam, R. Ng, and K.P. Murphy. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics, 22(14):e431–e439, 2006.
[17] F. Shen, J. Huang, K.R. Fitch, V.B. Truong, A. Kirby, W. Chen, J. Zhang, G. Liu, S.A. McCarroll, K.W. Jones, and M.H. Shapero. Improved detection of global copy number variation using high density, nonpolymorphic oligonucleotide probes. BMC Genetics, 9(27):1471–2156, 2008.
[18] A.M. Snijders, N. Nowak, R. Segraves, S. Blackwood, N. Brown, J. Conroy, G. Hamilton, A.K. Hindle, B. Huey, K. Kimura, S. Law, K. Myambo, J. Palmer, B. Ylstra, J.P. Yue, J.W. Gray, A.N. Jain, D. Pinkel, and D.G. Albertson. Assembly of microarrays for genomewide measurement of DNA copy number. Nature Genetics, 29(3):263–264, 2001.
[19] J.A. Snyman. Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradientbased Algorithms. Springer, New York, 2005.
[20] S. SolinasToldo, S. Lampel, S. Stilgenbauer, J. Nickolenko, A. Benner, H. Dohner, T. Cremer, and P. Lichter. Matrixbased comparative genomic hybridization: biochips to screen for genomic imbalances. Genes, Chromosomes and Cancer, 20(4):399–407, 1997.
[21] C. Sutton and A. McCallum. An introduction to conditional random fields for relational learning. Introduction to Statistical Relational Learning, chapter 93:142–146, 2007.
[22] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In A. Darwiche and N. Friedman, editors, Proceedings of the Eighteenth Conference in Uncertainty in Artificial Intelligence (UA02), pages 485–492. Morgan Kaufmann Publishers, 2002.
[23] K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner, S.F. Grant, H. Hakonarson, and M. Bucan. PennCNV: an integrated hidden Markov model designed for highresolution copy number variation detection in wholegenome SNP genotyping data. Genome Research, 17(11):1665–1674, 2007.
[24] H. Willenbrock and J. Fridlyand. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21(22):4084–4091, 2005.
[25] X.L. Yin and J. Li. Detecting copy number variations from array CGH data based on a conditional random field model. Journal of Bioinformatics and Computational Biology, 8(2):295–314, 2010.
Notes:
(^{1}) Recall that for a vector $b=({b}_{1},{b}_{2},\dots ,{b}_{n})$ of length n, L2norm of b is computed as $\sqrt{{b}_{1}^{2}+{b}_{2}^{2}+\cdots +{b}_{n}^{2}}$.