^{1}

^{*}

^{2}

^{3}

^{3}

^{4}

^{4}

Conceived and designed the experiments: GS JM SW. Performed the experiments: JM MI. Analyzed the data: GS AA. Wrote the paper: GS AA JM.

The discovery that copy number variants (CNVs) are widespread in the human genome has motivated development of numerous algorithms that attempt to detect CNVs from intensity data. However, all approaches are plagued by high false discovery rates. Further, because CNVs are characterized by two dimensions (length and intensity) it is unclear how to order called CNVs to prioritize experimental validation.

We developed a univariate score that correlates with the likelihood that a CNV is true. This score can be used to order CNV calls in such a way that calls having larger scores are more likely to overlap a true CNV. We developed cnv.beast, a computationally efficient algorithm for calling CNVs that uses robust backward elimination regression to keep CNV calls with scores that exceed a user-defined threshold. Using an independent dataset that was measured using a different platform, we validated our score and showed that our approach performed better than six other currently-available methods.

cnv.beast is available at

In any procedure for calling CNVs, there will be false positive calls made. While it may seem clear that CNV calls that are longer and/or feature a larger change in log intensity ratio (LIR) are more likely to be validated, it is not clear how to combine length and LIR information into a single measure that can be used to rank CNV calls. Optimally, such a measure would correlate with the chance that a CNV would be experimentally validated. All current methods of calling CNVs are in some way based on statistical information,

To develop a univariate measure that predicts experimental validation, we introduce a family of scores of the form

Because our score is chosen to correlate with the chance a CNV is validated, we wanted to make calls based on this score; in particular, we wanted a fast, easy-to-use algorithm that would call CNVs based on their score, keeping those that exceed a user-specified minimum score. To this end, we developed cnv.beast (

Although there are numerous algorithms now available for finding CNVs from either array or gene-chip data, few are based on regression. The majority are either change-point algorithms (

Three regression-based algorithms for finding CNVs are currently available: GLAD

The remainder of the paper is organized as follows. We first analyze a set of experimentally-validated calls made using Nimblegen data on the X-chromosome to determine a score function that correlates with the chance that a called CNV overlaps with a true CNV. We then develop cnv.beast, a novel backward-elimination regression algorithm that keeps CNVs having scores that exceed a user-defined threshold. Finally, we validate our approach by using data on deletions in eight Hapmap samples that have been experimentally determined

We assume that the observed data comprise the log-intensity ratio (LIR) values at a series of probes having known position in the genome, either from an aCGH experiment or from quantitative intensity data from a genotyping platform (

To choose the value of

Copy number variants were called using the NimbleScan (NS) software package version 2.4, an implementation of circular binary segmentation analysis, distributed by Nimblegen. Each sub-array was analyzed separately. Data from non-unique probes as well as data from approximately 5% of poorly-behaving probes having unusually large variance was discarded. Spatial correction and normalization were performed using NS, then segment boundaries were determined using the default parameters (no minimum difference in LIR that segments must exhibit before they are identified as separate segments; two or more adjacent probes required to call a change in LIR; maximum stringency for selecting initial segment boundaries).

The NS package gives a list of segment boundaries; because change in LIR across boundaries may be negligible, segment boundaries do not necessarily correspond to CNVs. We selected as CNVs those segments for which the absolute value of the mean LIR was greater than the absolute value of the sum of the mean LIR for the sub-array LIR plus one standard deviation. To identify a parsimonious set of segments for validation, CNVs were merged if their endpoints were within 3kB and if their mean LIRs had the same sign. The LIR of the merged CNV was taken to be the weighted average of the unmerged LIRs, weighted by the number of probes. Finally, to increase reliability, CNVs were only called if the probe density was greater than 9 probes/kB (

Using Nimblescan as described above, we obtained 414 putative CNVs. Experimental determination of validation status using PCR amplification followed by gel electrophoresis was successfully completed for 111 putative CNVs called among 41 persons. For generalizability to multiple platforms, we quantile normalized the LIR data before further analysis. Based on examination of both the X-chromosome data described above and data from Affymetrix arrays (data not shown), we chose to quantile normalize to a

We fit a logistic regression model with validation status (

CNVs found in the same individual as well as overlapping CNVs found in multiple individuals were treated as independent when fitting this model. The region of

Note that for any scoring function of the form

By fitting the logistic model, we found

Triangles correspond to true positives, ellipses to false positives.

We describe cnv.beast, a novel backward elimination algorithm for regression analysis of CNV data that is based on the availability of a univariate score function for ordering CNVs. The algorithm is based on a regression model in which LIR data from an individual is regressed on a series of step functions having jumps at each probe. We treat each step function (jump) as a CNV (with length given by the distance to the nearest jump and height given by the change in predicted magnitude) so that a score for each term can be calculated. Then, the algorithm implements backward elimination until each term in the regression model has a score higher than a user-specified cutoff ^{*}, while also eliminating CNVs that contain fewer than _{min}_{min}

We use backward elimination to avoid masking. Masking occurs in forward selection algorithms when a term that would correspond to one boundary of a CNV is not entered into the model because the term that corresponds to that CNVs other boundary is not yet in the model. For example, a CNV comprised of probes 80–120 would be described by two terms in

We advocate quantile normalization of the log intensity ratios even if the numerator and denominator have already been normalized, so that the same cutoffs can be used for all datasets. As described previously, we normalized to a student t distribution with 5 degrees of freedom, scaled so that the median absolute deviation was 0.2. We made this choice so that our cutoffs for quantile normalized data could also be reasonably applied to untransformed data if necessary.

Let _{i}_{i}_{j}_{j}_{j}_{j}_{j}_{j}_{i}

While it is possible to determine fit by using least squares,

When

Our backward elimination algorithm begins with the saturated model

Backward elimination is facilitated by the following observations. First, the values of

We now describe how we choose which jumps to eliminate so that only CNVs having ‘large’ scores are retained. We define the ‘gap’ between the probe _{j}_{j}_{j}_{j}_{min}_{max}_{min}

^{16}) that is much larger than the absolute value of the largest log intensity ratio in the data. The ‘default’ values of the parameters _{min}_{max}_{min}

Parameter | Default Value |

_{min} | 6 |

_{max} | 30 |

µ_{min} | 0.25 |

0.5 |

Having chosen the form of the cutoff function

note that ^{*} for which

Our backward elimination algorithm can be efficiently executed with a single pass through a sorted list of the values of

As described above, we choose to remove the jump at _{1}_{2}_{2}_{1}_{3}_{4}_{2}_{3}

As illustrated, we would remove the probe at _{2} rather than the probe at _{1}, and then the probe at _{4} rather than the probe at _{3} even though

Although our algorithm is computationally efficient, calculating medians for large numbers of probes between CNVs slows the algorithm as the number of probes _{i}_{i}_{i}

When

At the termination of the algorithm just described (either with or without the use of blocks), the log-intensity ratios predicted by (3) form a step function in which each jump is ‘large enough’ compared with the gap between adjacent retained probes to satisfy our model selection criterion (7). However, because we have not required that the predicted log-intensity ratio return to zero between adjacent CNVs, it can occur that the predicted intensity between probes is actually less than µ_{min}_{min}

As the backward elimination step has terminated, each jump is larger than the appropriate cutoff. At the start of the cleanup step, _{1}) and _{3}) would be set to zero, decreasing

Finally, even after the cleanup step, some regions may have several jumps before returning to zero. Typically, this occurs for long CNVs. When the predicted LIR has the same sign over the entire region, we use the average intensity over the region, weighted by the number of probes. In those rare cases where a sign change occurs, we consider the probe at which the sign changes to be a boundary between two (adjacent) CNVs. Thus, if a region has first positive and then negative LIR values, we could consider that two CNVs are adjacent; if this occurs, we separately average the predicted intensities over any jumps occurring in regions where the LIR was positive and negative. Scores are then calculated using the length of the region and the averaged intensity.

We first applied our algorithm to the Nimblegen data that we used previously to determine the score exponent

Validation | Detected by CNV.BEAST | |

Status: | Yes | No |

True | 38 | 6 |

False | 50 | 17 |

To assess the performance of our approach in an independent dataset, we analyzed Illumina 1M data from eight Hapmap participants. Deletions among these individuals were determined experimentally by Kidd et al.

We first validated that our choice of CNV score was predictive of validation in these data by fitting the logistic regression model (1) to quantile-normalized data. We found good agreement between the exponent we obtained using Nimblegen X-chromosome data and the Illumina data; the estimated exponent was 0.59 with 95% confidence interval (0.34, 1.04). In

Triangles correspond to true positives, ellipses to false positives.

Cnv.beast performed well when compared to the six methods considered by Ely (see

Method | # of calls | Sensitivity | # of calls >6kbp | FDR |

Circular BinarySegmentation | 315 | 0.218 | 104 | 0.788 |

Hidden MarkovModel | 20,226 | 0.287 | 1,081 | 0.957 |

Segmentation/Cluster | 837 | 0.208 | 55 | 0.691 |

Wavelet-basedSegmentation | 13,665 | 0.198 | 187 | 0.840 |

Fused Lasso | 655 | 0.248 | 130 | 0.808 |

RobustSmoothing | 37 | 0.059 | 29 | 0.690 |

CNV>BEAST | 195 | 0.299 | 167 | 0.826 |

Sensitivity is the number of true deletions that overlap at least partially with a called deletion, divided by the number of true deletions.

FDR is the number of called deletions that do not overlap even partially with a true deletion, divided by the number of called deletions.

Although the FDR of cnv.beast was 82.6% overall, it is possible to achieve lower FDRs by further filtering the list of called CNVs to those having score greater than some specified value. For example, the FDR for our method among calls made by cnv.beast using the default parameters in

CNVs are characterized by both height (intensity) and length (number of probes), making it difficult to predict which calls are valid. Using X-chromosome high-density Nimblegen array CHG data, we propose a univariate score that incorporates both intensity data and the number of probes in a call, and that predicts the probability a CNV is valid. We then showed that the same score is a valid predictor of experimental validation in Illumina gene chip data.

Based on the concept of a univariate score, we then developed cnv.beast, a novel backward elimination regression algorithm that keeps terms corresponding to CNVs that exceed a user-defined threshold. Using data from eight Hapmap participants, we showed that cnv.beast had superior performance when compared to the six other methods considered by Ely

Because our score correlates with the chance of validation, it is a useful quantity to calculate for CNVs called by any method. The plot of FDR as a function of score for the eight Hapmap individuals shown in

As a univariate quantity, the score can facilitate Monte-Carlo significance testing. Specifically, if we can generate replicate datasets that are known to have no signal, then for each replicate dataset, we can record the largest score found among all CNVs detected. This distribution can then be used to assign a

Many algorithms for calling CNVs were initially developed for data from cancer cell lines, where copy number changes are often long and hence may be easier to detect. As

A fortran program to unleash the power of the beast, as well as an R shell to run it and a pdf file with usage notes, is available at

We thank Li Hsu and Ben Ely for graciously giving us access to the HapMap data and early access to Ben Ely’s Master’s thesis.

Disclaimer: The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.