Highly mutable RNA viruses exist in infected hosts as heterogeneous populations of genetically close variants known as quasispecies. Next-generation sequencing (NGS) allows for analysing a large number of viral sequences from infected patients, presenting a novel opportunity for studying the structure of a viral population and understanding virus evolution, drug resistance and immune escape. Accurate reconstruction of genetic composition of intra-host viral populations involves assembling the NGS short reads into whole-genome sequences and estimating frequencies of individual viral variants. Although a few approaches were developed for this task, accurate reconstruction of quasispecies populations remains greatly unresolved.

Two new methods, AmpMCF and ShotMCF, for reconstruction of the whole-genome intra-host viral variants and estimation of their frequencies were developed, based on Multicommodity Flows (MCFs). AmpMCF was designed for NGS reads obtained from individual PCR amplicons and ShotMCF for NGS shotgun reads. While AmpMCF, based on covering formulation, identifies a minimal set of quasispecies explaining all observed reads, ShotMCS, based on packing formulation, engages the maximal number of reads to generate the most probable set of quasispecies. Both methods were evaluated on simulated data in comparison to Maximum Bandwidth and ViSpA, previously developed state-of-the-art algorithms for estimating quasispecies spectra from the NGS amplicon and shotgun reads, respectively. Both algorithms were accurate in estimation of quasispecies frequencies, especially from large datasets.

The problem of viral population reconstruction from amplicon or shotgun NGS reads was solved using the MCF formulation. The two methods, ShotMCF and AmpMCF, developed here afford accurate reconstruction of the structure of intra-host viral population from NGS reads. The implementations of the algorithms are available at

RNA-dependent RNA-polymerases of RNA viruses are error prone and frequently generate mutations, accumulation of which results in a diverse intra-host viral population of closely related genetic variants [

The advent of Next-Generation Sequencing (NGS) presented invaluable opportunity for the in-depth evaluation of viral quasispecies and understanding the structure of viral intra-host populations in unprecedented detail [

The problem of reconstruction of a structure of viral population formulated as

Most algorithms for the quasispecies spectrum reconstruction implicitly assume that sequence data were obtained using a shotgun experiment. Although the shotgun method is frequently used for reconstruction of long sequences and produces less distortion in frequency of quasispecies than the amplicon-based approach, the available NGS error correction algorithms are most efficient when applied to amplicon-based data [

In this paper, we consider two methods, AmpMCF and ShotMCF, for reconstruction of the genetic structure of intra-host viral population using either amplicon or shotgun NGS reads, respectively. Both methods are based on the application of MultiCommodity Flow problem (MCF) [

MCF is a classical optimization problem that searches for _{i}, t_{i}

Quasispecies reconstruction can be formulated as an optimization problem in two ways: (1) identification of the most probable set of quasispecies formed by the largest subset of reads from the data, referred to as packing formulation; and (2) identification of a minimal set of quasispecies explaining all observed reads, referred to as covering formulation. These two formulations, when applied to MCFs, were developed into path packing and path covering algorithms of ShotMCF and AmpMCF, respectively.

We consider an amplicon _{1}_{2 }_{1}) ≤ start(_{2}_{1}_{2}). A set of amplicons _{1}, ..., _{m}_{i }_{i+1 }overlap for _{1}⋃...⋃_{m }_{i}_{i+1 }and _{i,i+1 }= end(A_{i})- start(A_{i+1})+1, i.e., the suffix of length l_{i,i+1 }of r coincides with the prefix of length l_{i,i+1 }of r'.

Given an overlapping set _{1}, ..., _{m}_{1 }⋃ ... ⋃ _{m }_{i}_{v }_{i}_{u }_{v }_{1 }or _{m}_{1 }⋃ ... ⋃ _{m }_{i }_{i}_{i }

A bipartite clique of _{i}_{i+1 }for some _{i }_{i+1}.

_{i}, A_{i+1 }correspond to disjoint bipartite cliques in G

_{i }_{i+1}, such that _{v }_{u}_{v }_{u'}, _{v' }_{u}_{v' }_{u'}. Since _{v' }and _{u }_{v' }and _{u' }are not, the prefixes of length _{i,i+1 }of _{u }_{u' }must not be consistent. This implies a contradiction with _{v }_{u }_{v }_{u'}.

Using this simple finding, we transform the read graph _{p,q }of _{i}_{i+1 }for some _{fork}_{fork}_{i}_{fork}_{i+1}}. Define a new edge weight function _{fork}_{fork}v) = c(v), d(su) = d(vt)

As for _{p,q }

The quasispecies reconstruction problem may be restated as the following covering problem:

_{i }_{i }

We next reformulate Problem 1 as an MCF problem. Suppose that

s.t.

The variables ^{i }^{i }

This condition is guaranteed by the constraints (5). Constraints (2) and (3) are covering and flow conservation constraints, respectively. Constraints (4) guarantee that flows ^{i }_{i }_{i}

The frequency of

The input is a set of distinct reads _{v }_{1}, ..., _{k}

1) for each read _{v }

2) the directed edge (_{u }_{v }

3) for each candidate sequence _{i}_{i }_{i}_{i},v_{i}_{v }_{i}

Let a read _{v }_{i }^{i}_{v }_{v }_{i}

where

Using the read-graph constructed above, the quasispecies frequencies estimation problem can be formulated in terms of MCF as follows. Each (_{i}_{i}_{i }^{i}_{v}^{i}_{v }

s.t.

Here

In order to validate the devised methods, we used reads simulated from experimentally identified intra-host HCV variants or quasispecies.

The simulated reads were generated using individual 1734-nt sequences derived from the E1/E2 genomic region of intra-host HCV variants [

1) In the uniform distribution all sequences have approximately equal frequencies, which were calculated as normalized numbers of times each sequence was chosen in 1000 independent trials, where at each trial one of sequences was randomly chosen with an equal probability.

2) In the geometric distribution frequencies form a geometric progression. The frequencies were calculated by taking 10 first terms in geometric progressions and normalizing them.

3) In the skewed distribution one of the sequences has a high frequency, while the remaining sequences have uniformly low frequencies (generated as in 1).

The read lengths followed a normal distribution with mean value of 320nt and variance of 10nt. The number of reads in each simulated data set varied from 5K to 300K for ShotMCF and from 5K to 100K for AmpMCF. Shotgun reads were simulated using FlowSim [

For each size of a data set and for each distribution type 11 independent simulated data sets were generated, averages of measures of algorithms quality were calculated and the statistical significance of algorithms comparison was estimated using a Kruskal-Wallis test [

Problems formulations (1)-(7) and (11)-(14) were solved using the IBM ILOG CPLEX solver 12.2 (

P-values for a Kruskal-Wallis test were calculated using MATLAB (

The reconstructions obtained using ShotMCF and EM algorithms from ViSpA [_{1}, ..., _{n}_{1}, ..., _{n}

Figures

Statistical significance of the comparison of ShotMCF and EM

Geometric distribution | |||||||
---|---|---|---|---|---|---|---|

5000 | 20000 | 100000 | 150000 | 200000 | 250000 | 300000 | |

0.000071 | 0.000071 | 0.000071 | 0.002263 | 0.000122 | 0.000160 | 0.000093 | |

0.000071 | 0.000913 | 0.000071 | 0.038598 | 0.006428 | 0.016540 | 0.000071 | |

5000 | 20000 | 100000 | 150000 | 200000 | 250000 | 300000 | |

0.000071 | 0.000071 | 0.000071 | 0.000443 | 0.000566 | 0.001449 | 0.005258 | |

0.000071 | 0.000071 | 0.000122 | 0.000345 | 0.000566 | 0.001449 | 0.005258 | |

5000 | 20000 | 100000 | 150000 | 200000 | 250000 | 300000 | |

0.000071 | 0.000071 | 0.027823 | 0.000071 | 0.000720 | 0.000093 | 0.000071 | |

0.000071 | 0.000071 | 0.027823 | 0.000071 | 0.001152 | 0.001152 | 0.000071 |

ShotMCF statistically significantly outperforms EM on large data sets with geometric and skewed distributions, while the quality of EM is higher on small data sets. The quality of quasispecies reconstruction by EM, as implemented in ViSpA [

The accuracy of frequency estimation for variants with different abundances was analysed (Figure

The reconstructions obtained using AmpMCF (k = 12) and the Maximum Bandwidth (MB) algorithm proposed in [

1) RMSE

2) Jensen-Shannon divergence (JSD). It replaces KLD used for ShotMCF testing, since for AmpMCF and MB sizes of the reconstructed quasispecies populations may differ from the size of the correct population. JSD differs from KLD divergence due to the addition of a midpoint and is defined as follows:

where

3) Sensitivity S, which is defined as

4) Positive predicted value (PPV), which is defined as

Here, if CandQ is the set of quasispecies found by the algorithm and SimQ is the set of simulated quasispecies, then TruePositives = CandQ⋂SimQ, FalseNegatives = SimQ\CandQ and FalsePositives = CandQ\SimQ.

RMSE and JSD measure the quality of quasispecies frequencies estimation, and Sensitivity and PPV measure the quality of assembled quasispecies. Sensitivity is a measure of the positive identifications, which is defined as the percentage of correctly assembled quasispecies out of the population. PPV is a measure of the negative identification, which is defined as the percentage of correctly identified quasispecies over all assembled quasispecies.

Figures

Statistical significance of the comparison of AmpMCF and EM

Geometric distribution | |||
---|---|---|---|

5000 | 20000 | 100000 | |

0.001100 | 0.000069 | 0.000070 | |

0.200130 | 0.742240 | 0.000718 | |

0.46294 | 0.11743 | 0.84517 | |

0.66827 | 0.79078 | 0.037853 | |

5000 | 20000 | 100000 | |

0.122800 | 0.061063 | 0.015030 | |

0.339790 | 0.818120 | 0.742170 | |

0.34978 | 0.78918 | 0.89135 | |

0.13832 | 0.89501 | 0.50755 | |

5000 | 20000 | 100000 | |

0.469220 | 0.717980 | 0.224440 | |

0.211260 | 0.004284 | 0.023486 | |

- | 0.12341 | 0.39881 | |

0.20846 | 0.53018 | 0.40896 |

According to RMSE, AmpMCF statistically significantly outperforms Maximum Bandwidth for all sizes of data sets with the geometric distributions, and for large data sets with the uniform distribution. Although AmpMCF exceeded in accuracy Maximum Bandwidth on the 5K and 20K datasets with the uniform distribution, the difference in performance was statistically insignificant, with p-value being slightly greater than the statistical significance threshold of 5%. For the skewed distribution the results were comparable without statistically significant advantage of one algorithm over the other.

According to JSD and PPV, ShotMCF statistically significantly outperforms Maximum Bandwidth on the 100K data sets with the geometric distribution, while Maximum Bandwidth had the lower JSD values on the 20K and 100K data sets with the skewed distribution. For all other measures, sizes and distributions the results were comparable with no statistically significant advantage of one algorithm over the other. The p-value for S could not be calculated for the 5K data sets with the skewed distribution, since both algorithms were equally sensitive on all test examples.

So AmpMCF outperformed Maximum Bandwidth in quasispecies frequencies estimation for populations with geometric and uniform distributions, while both algorithms showed a similar performance in quasispecies sequence reconstruction.

The low sensitivity of AmpMCF and Maximum Bandwidth on the 5K data set with the skewed distribution is associated with the erroneous reconstruction of low-abundance variants by both algorithms, with only a dominant variant being correctly identified. For larger data sets, populations with the skewed distributions were reconstructed much more successfully and variants with frequencies as low as 0.8% were detected. It should be also noted that low-frequency variants were detected with higher probability in populations with the geometric distribution (Figure

In general, abundances of variants greatly affect their recoverability, with high-frequency variants being easier to detect (Figure

The accuracy of frequency estimation for detected variants with different abundances is illustrated by Figure

Two different network-flows based formulations applicable to quasispecies frequency reconstruction problem were developed. The first quasispecies spectrum reconstruction method based on network flows (NF) was proposed in [

AmpMCF extends the concept developed in [

ShotMCF extends the ViSpA algorithm described in [

The formulation for AmpMCF could not be applied to shotgun data since it assumes that each full-length sequence corresponds to a unique (s,t)-path in the read graph. However, this is not true for the shotgun data since certain sequences can be assembled from reads through different paths. This observation taken together with consideration of the structure of the read graph described by Lemma 1 indicates that the formulation is more suitable for amplicons. The analogue of AmpMCF for a shotgun data is the NF-based algorithm from [

Although the formulation of ShotMCF is applicable to amplicons, AmpMCF is more suitable for this task since ShotMCF handles only the second stage of quasispecies spectrum reconstruction problem, with the first stage being the candidate sequence generation adopted from ViSpA; while AmpMCF incorporates the whole problem into a single formulation.

The structure of the read graph explains a better match of the amplicon data to the covering rather than to packing formulation implemented by Maximum Bandwidth. According to Lemma 1, consistent overlaps between consecutive amplicons form bipartite cliques in a read graph. Edges within each bipartite clique are equal in respect to choosing (s,t)-paths in a read graph. This leads to a large number of peer alternatives for quasispecies assembling, indicating the need to search for the most parsimonious solution. The NF-based formulation with parsimony as an objective function and without predefined flow sizes requires covering constraints, and, therefore, leads to the covering formulation.

The advantage of ShotMCF method over EM-based method of ViSpA originates from enforcing uniformity of quasispecies coverage and using more accurate formula for the probability of emission of a given read from a given candidate sequence. The major advantage of the EM algorithm over the current version of ShotMCF is a greater speed and reduced requirements for computational resources such as computer memory and number of parallel processors. The reason is that MCF is implemented directly using linear programming formulation. It is expected that application of faster methods; e.g., based on lagrangian relaxations or Bender decomposition, should dramatically increase performance of ShotMCF.

It should be noted that MCF formulations assume absence of gaps in coverage. Although such gaps interrupt the assembly of the entire sequence, the genomic regions covered with reads can be identified using a reference sequence and quasispecies can be estimated with MCF-based algorithms for each region independently.

Two novel methods were developed for the reconstruction of the structure of viral population from the NGS shotgun and amplicon reads. Both methods are based on MCF and found suitable for the reliable assembly of viral quasispecies and estimation of their frequencies.

Authors declare that they have no competing interests.

PS developed the algorithms and wrote the manuscript. NM developed, implemented and tested AmpMCF algorithm. AA developed, implemented and tested ShotMCF algorithm. BT prepared the testing data. IM contributed to designing the algorithms and writing the manuscript. YK contributed to designing the algorithms and writing the manuscript. AZ developed the algorithms, wrote the manuscript and supervised the project. All authors read and approved the final manuscript.

PS and YK were supported intramurally by Centers for Disease Control and Prevention. NM, BT, IM and AZ were supported in part by Agriculture and Food Research Initiative Competitive Grant no. 201167016-30331 from the USDA National Institute of Food and Agriculture and by Life Technology Grant "Viral Metagenome Reconstruction Software for Ion Torrent PGM Sequencer". NM, AA, BT and AZ were supported in part by NSF award IIS-0916401. IM was supported in part by NSF award IIS-0916948. NM and BT were supported in part by Molecular Basis of Disease Fellowship, Georgia State University.

Authors thank referees for valuable comments which helped to significantly improve the paper.

Publication of this article was funded intramurally by Centers for Disease Control and Prevention.

This article has been published as part of