Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.

The implementations of the algorithms and data sets used for their testing are available at:

Recent advances in the next-generation sequencing (NGS) methods allow for analyzing the unprecedented number of viral variants from infected patients and present a novel opportunity for understanding viral evolution, drug resistance and immune escape [

The main purpose of an error correction algorithm for viral amplicons is to discriminate between artifacts and actual sequences. This task becomes especially challenging when applied to recognizing and preserving low-frequency natural variants in viral population. Currently, the most used error correction algorithms involve the clustering of reads [

The aforementioned methods are optimized for shotgun analysis and assume that the errors introduced by sequencing are randomly distributed. However, a recent assessment of the accuracy and quality of amplicon reads obtained using 454-sequencing showed that the error rate is not randomly distributed; rather, it is strongly affected by the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates [

In this paper, we present two new efficient error correction algorithms optimized for viral amplicons. The first algorithm (ET) includes a calibration step using sequence reads from single-clone samples. ET estimates an empirical frequency threshold for indels and haplotypes calculated from experimentally obtained clonal sequences, and also corrects homopolymer errors using sequence alignment. The second algorithm (KEC) does not need a calibration. It is based on the k-mer error correction. KEC optimizes the EDAR algorithm [

A set of 10 plasmid clones with different HCV HVR1 sequences was obtained. All the clones contained the modified HCV JFH1 sequence (GenBank accession number

A total of 24 samples of the plasmids were created, with 14 containing a single clone and 10 containing a mixture of clones in different concentrations (see Table

Relative concentrations of the 10 clones in each sample and the number of raw reads obtained in the 454 experiment.

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | N reads | |
---|---|---|---|---|---|---|---|---|---|---|---|

100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12096 | |

100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11134 | |

0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13183 | |

0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12080 | |

0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 15507 | |

0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 0 | 9643 | |

0 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 0 | 16215 | |

0 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 0 | 10583 | |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 29101 | |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 24230 | |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 20560 | |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 19133 | |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 16542 | |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 0 | 0 | 23629 | |

2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 80.0 | 0 | 0 | 21168 | |

12.5 | 12.5 | 12.5 | 12.5 | 12.5 | 12.5 | 12.5 | 12.5 | 0 | 0 | 17577 | |

10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 30.0 | 0 | 0 | 18482 | |

10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 20.0 | 20.0 | 0 | 0 | 18722 | |

1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 93.0 | 0 | 0 | 24519 | |

0.0 | 8.7 | 8.7 | 8.7 | 8.7 | 21.7 | 21.7 | 21.7 | 0 | 0 | 11561 | |

5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 20.0 | 50.0 | 0 | 0 | 20931 | |

3.1 | 3.1 | 3.1 | 3.1 | 3.1 | 0.0 | 42.3 | 42.3 | 0 | 0 | 18733 | |

80.0 | 2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 0 | 0 | 18007 | |

2.0 | 2.0 | 2.0 | 2.0 | 0.0 | 30.6 | 30.6 | 30.6 | 0 | 0 | 20677 |

The main purpose of the procedure is to calculate the frequency of erroneous haplotypes in amplicon samples where a single haplotype is expected. The calculation of an accurate threshold is dependent on high-quality pairwise sequence alignments and proper correction of homopolymers. The procedure was carried out with matlab [

(1) Amplicon size limits: All reads smaller than 90% of the expected amplicon length are deleted and all reads bigger than 110% of the expected amplicon length are clipped. All different haplotypes and their frequencies are calculated, which saves considerable time and memory at the following steps.

(2) Alignment to external references: Each haplotype is aligned against a set of external references of all known genotypes. For each haplotype the best match of the external set is chosen. The aligned sequence is clipped to the size of the chosen external reference.

(3) Alignment to internal references: The 20 most frequent haplotypes that do not create insertions or deletions in regard to the external reference are selected as the internal reference set. Each haplotype in the dataset is aligned against each member of internal references set. For each haplotype the best match of the internal set is chosen.

(3) Homopolymer correction: All homopolymers of 3 or more nucleotides are identified. If the homopolymer region includes an insertion, the nucleotide is removed. If the homopolymer includes a deletion, the gap is replaced by the missing nucleotide. Then all different haplotypes and their frequencies are recalculated.

(4) Outlier removal: All reads containing obvious PCR or sequencing artifacts are removed. Using the internal reference, the number of indels in each haplotype is found. An outlier threshold is defined for each file, calculated as the weighted average of the number of indels + 4 standard deviations. If a haplotype contains a number of indels higher than the outlier threshold, the haplotype is removed.

(5) Indel threshold: The read aligned to its reference is used to calculate the frequency of erroneous indels over all the 14 samples containing a single clone. An indel threshold is defined as the maximum frequency of erroneous indels (5.9%). If a haplotype contains an indel with a frequency lower than the indel threshold, the haplotype is removed.

(6) Haplotype error threshold: The frequency of erroneous haplotypes and its standard deviation is calculated over the 14 samples containing a single clone. A haplotype threshold was defined as the weighted average frequency of erroneous haplotypes + 9 standard deviations (0.40%). All haplotypes with a frequency lower than the haplotype threshold are removed.

(7) Removal of reads with Ns: All haplotypes with Ns are removed from the final file. This step was performed at the end rather than at the beginning to take advantage of the information that these reads provided regarding nucleotide frequencies at positions other than those with N.

The scheme of KEC includes 4 steps:

(1) Calculate k-mers s and their frequencies kc(s) (k-counts). We assume that k-mers with high k-counts ("solid" k-mers) are correct, while k-mers with low k-counts ("weak" k-mers) contain errors.

(2) Determine the threshold k-count (error threshold), which distinguishes solid k-mers from weak k-mers.

(3) Find error regions. The error region of the read is the segment [i, j] such that for every p ∈ [i, j] the k-mer starting at the position p is considered weak.

(4) Correct the errors in error regions.

Let r = (r_{1}, ..., r_{n}), r_{i}∈{A, T, G, C} be the fixed read. Denote by S_{k}(i) the k-mer of r starting at the position i and by KC_{k}(i) the k-count of this k-mer. For an arbitrary sequence s let pref_{j }(s) be the prefix of length j of s.

The unique reads r were stored together with their frequencies f_{r}. The straightforward calculation of k-mers and k-counts is inefficient due to the usually large size of the data set. We use hash map, where each key is a k-mer s and the corresponding value is the array v(s) = ((r, i): s = S_{k}(i) in the read r). The hash map can be rapidly constructed even for very large data sets.

The idea proposed in [_{er}. The length of the segment is the parameter of the algorithm.

The error regions in every read are calculated as follows. We first sequentially find isolated segments [i, j] such that for every p ∈ [i, j] KC_{k}(p) ≤ t_{er}. Then the k-mers of the read are clustered according to their k-counts using clustering by the variable bandwidth mean-shift method [_{k}(p) and S_{k}(q) belong to the same cluster. Overlapping segments are joined, and the obtained segments are error regions.

This stage consists of 3 steps:

(4a) Error correction in "short" error regions (with lengths not exceeding k).

(4b) Error correction in "long" error regions (with lengths greater than k).

(4c) Haplotypes reconstruction and postprocessing.

Steps (4a) and (4b) could be used for any sequencing data and could be considered as the separate algorithm. Stage (4c) is designed for amplicon data.

(4a) is based on the following principle: correct the error in the read r by finding the minimum-size set of insertions, deletions and replacements, which transform the r into read with all k-mers being solid. This problem could be precisely solved by the dynamic programming [

We call error region x = [b, e] of a read r = (r_{1}, ..., r_{n}) a tail, if either b = 1 or e = n-k+1. Let l(x) be the length of x, and h_{i}(w) denotes a sequence of i identical nucleotides w∈{A, T, G, C} (for i ≥ 2 h_{i}(w) is a homopolymer).

Let us introduce the following notation for different types of single-nucleotide errors: Rep denotes replacement; Ins_{1 }-- insertion, which does not create a homopolymer; Ins_{p}, p ≥ 2, -- insertion, which creates a homopolymer of length p; Del_{0 }-- deletion of an isolated nucleotide; Del_{m }-- deletion from the homopolymer of length m+1, m ≥ 1. The straightforward verification shows, that the following proposition holds:

_{e}

_{p}, 1 ≤ p ≤ k, then l(x) = k-p+1 and if p ≥ 2, the_{p-1}(w)

_{m}, 0 ≤ m ≤ k, then l(x) = k-m-1 and if m ≥ 1, then x is followed by h_{m}(c), where c≠w

According to Lemma 1, errors corresponding to non-tail error regions with lengths ≤ k could be identified and corrected. If the error region x = [b, e] of a read r = (r_{1}, ..., r_{n}) is a tail, then we delete from the read the suffix starting at the position b+k-1 (if e = n-k+1) or the prefix ending in the position e (if b = 0). This is the tail cutting operation. So, the basic scheme of the first stage of the error correction algorithm is the following.

1) Consider every non-tail error region x = [b, e] with length not exceeding k of every read r = (r_{1}, ..., r_{n}). We assume, that x was caused by single isolated error at the position e. Taking into account the length of x and the sequence of nucleotides following it, identify the type of error using Lemma 1. If l(x) < k-1, then the type of error and its correction can be determined unambiguously. In the case of insertion remove r_{e}; in the case of deletion duplicate r_{e}, if it will introduce a solid k-mer. According to Lemma 1 the cases l(x) = k and l(x) = k-1 contains ambiguities. If l(x) = k, then x could be caused either by nucleotide replacement with 3 possible corrections or by simple nucleotide insertion. If l(x) = k-1 and r_{e }= r_{e+1}, then x could be caused either by the insertion of the nucleotide r_{e }or by the deletion of the nucleotide c≠r_{e }between r_{e }and r_{e+1}. Consider all possible corrections of the error and choose the correction, which introduce solid k-mer with the highest k-count.

2) Cut tails, delete short reads, recalculate k-mers and error regions, delete reads covered for more than 40% with error regions. Storing k-mers in the hash map allows to perform both steps 1) and 2) very fast.

3) Repeat steps 1) and 2) until there are no error regions in the data set or the fixed number of iterations is reached.

Algorithm 1 is heuristic. It assumes that errors in two consecutive nucleotides are extremely unlikely. There are elements of greedy strategy at different stages of the algorithm. Nevertheless, the disadvantages of greedy algorithms are less detrimental in real data. For example, during the consideration of error regions with lengths k and finding the possible replacement of the nucleotide r_{e}, the existence of several different strong k-mers s with pref_{k-1}(s) = (r_{b}, ..., r_{e-1}) (*) is possible. To find the replacement of r_{e }Algorithm 1 will choose s with highest k-count, which is a greedy approach. However, the tests of Algorithm 1 on the selection of 24 different data sets with HCV sequencing data shows, that for the overwhelming majority of error regions of length k the strong k-mer s satisfying (*) is unique. The percentage of such error regions varies from 86 to 99.9% with average of 95.9%.

The error regions with lengths greater than k likely correspond to the situations when two or more errors are separated by less than k positions. This situation is significantly less probable than the presence of a single error. In Figure

However, a certain number of "long" error regions is inevitable. We can locate the possible error bases for the error region x = [b, e], len(x) > k - it is the positions b+k-1 and e. However, for such error regions we lose the opportunity for prediction of the error type by combining their length and nucleotide sequence following it, because the length of error regions corresponding to the individual errors could not be determined.

There two ways to treat "long" error regions. One is to discard all reads with errors uncorrected by Algorithm 1. In this case, the error threshold and error regions are recalculated after finishing Algorithm 1, and reads containing error regions are discarded. The other way is to correct errors in "long" error regions. All possible errors at positions b+k-1 and e are considered to choose the correction procedure causing the introduction of k-mer with the highest k-count. Since these corrections are less reliable than for "short" error regions, correction of "long" error regions is conducted at the end of the algorithm after correcting "short" error regions, and Algorithm 1 is applied again before generating the final output of corrected reads. Both approaches are implemented in KEC, allowing users to exercise their preferences.

In the error-free data set of amplicon reads the collection of unique reads should be identical to the set of haplotypes, and the frequencies of unique reads should be proportional to the concentrations of haplotypes. Errors result in the increasing number of unique reads and divergence between the frequencies and concentrations.

The steps (4a) and (4b) dramatically reduce the numbers of unique reads in the data set. The corresponding statistics is presented in Table

Number of reads in the datasets before and after steps (4a), (4b)

Before (4a), (4b) | 4220 | 4222 | 4418 | 4344 | 4426 | 3118 | 4661 | 4223 | 3986 | 4839 |

After (4a), (4b) | 306/8 | 502/18 | 385/8 | 483/9 | 179/2 | 390/14 | 409/8 | 367/8 | 394/11 | 418/12 |

As illustrated in Table

Let R = {r_{1}, ..., r_{n}} be a set of unique reads obtained after steps (4a),(4b), (f_{1}, ..., f_{n}) be the frequencies of these reads and R_{max }⊆ R be a set of maximal reads of R. Let a_{i, j }= 1, if the read r_{j }is a subsequence of the read r_{i}, and a_{i, j }= 0, otherwise, i, j = 1, ..., n (by the definition a_{i, i }= 1). Let also d_{j }be the number of reads, which contain r_{j }as the subsequence. The basic scheme of the error correction and haplotypes reconstruction algorithm is the following.

1) For every r_{i}∈R set its initial concentration

2) Calculate set R_{max}. For every r_{i}∈R_{max }recalculate its concentration by the following formula:

3) Calculate the multiple alignment of reads of R_{max}. Identify homopolymer regions in the alignment. For every position of each homopolymer region calculate the total concentrations of reads with and without gap at this position. If both of these concentrations are non-zero and one of them is α times greater than the other, correct the position accordingly.

4) Put R: = R_{max }and repeat 1) and 2). Calculate neighbor joining tree T of reads from R_{max }based on their pairwise alignment score. Let P be the set of pairs of reads having common parents in T. For every (r_{i}, r_{j}) ∈P consider pairwise alignment of r_{i }and r_{j}. Identify homopolymer regions as in 3). Correct the position of the homopolymer region if and only if either the nucleotide difference between r_{i }and r_{j }is 1 or their concentrations differ more than α times.

5) Put R: = R_{max }and repeat 1)-4) until no corrections can be made.

For the datasets described in this paper, α = 30 was used. ClustalW [

ET was implemented in Matlab and KEC in JAVA. Each sequence file was analyzed using ET, KEC and SHORAH error correction algorithms. SHORAH was applied several times using different parameters and the best attained results are reported here.

To evaluate performance of the three algorithms, nucleotide identity and frequency of the observed and expected true haplotypes were compared. Before doing this, the outputs of the three algorithms were post-processed to assure a fair comparison. The true haplotypes expected in each sample were aligned with the observed haplotypes using Needleman-Wunsch global alignment. The true haplotype with the best score was considered to be the match for each haplotype and the gapped ends were clipped in both sequences. For each method and sample, the following measures were calculated: (i) Missing true haplotypes, the number of expected haplotypes which were not found among the observed haplotypes; (ii) False haplotypes, the number of observed haplotypes with indels or nucleotide substitutions; (iii) Root mean square error, the disparity between the expected and observed frequencies of haplotypes; (iv) Average hamming distance, the distance between an observed false haplotype and its closest match, averaged over all false haplotypes.

The errors in reads of every single-clone sample were calculated by aligning each read with the corresponding clone sequence. The average percentage of error-free reads (true sequence) in the single-clone samples was 54.02% (Figure

A minimum spanning tree (Figure _{dist }of the dataset S6 illustrates the degree of sequence errors generated during 454-sequencing. This tree shows sequence relatedness of all unique haplotypes observed among the 454-reads of a single-clone sample. G_{dist }is a complete weighted graph, the vertices of G_{dist }are unique haplotypes, the weight of each edge h_{1}h_{2 }is the edit distance between h_{1 }and h_{2}. Most errors are found in low frequency haplotypes but homopolymer errors are usually found in high-frequency haplotypes.

Figure

Although the total number of nucleotide errors is high, they occur in different positions, making the frequency of individual reads with a particular error very low. The case with homopolymers errors is different, the longer a homopolymer is, the higher its fraction of errors (Figure

Table

Test results of the single-clone (S) and mixture (M) samples.

ET | KEC | SHORAH_all | SHORAH_1% | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

S1 | 0 | 0 | 0.00 | 0.00 | 0 | 2 | 1.64 | 2.50 | 0 | 351 | 29.02 | 4.83 | 0 | 3 | 19.19 | 2.00 |

S2 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 269 | 30.12 | 4.44 | 0 | 3 | 21.70 | 1.00 |

S3 | 0 | 1 | 1.04 | 1.00 | 0 | 0 | 0.00 | 0.00 | 0 | 292 | 23.44 | 5.31 | 0 | 2 | 14.94 | 1.00 |

S4 | 0 | 1 | 0.96 | 2.00 | 0 | 0 | 0.00 | 0.00 | 0 | 271 | 44.68 | 5.37 | 0 | 1 | 37.87 | 1.00 |

S5 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 319 | 9.63 | 4.47 | 0 | 0 | 0.00 | 0.00 |

S6 | 0 | 1 | 0.70 | 2.00 | 0 | 0 | 0.00 | 0.00 | 0 | 194 | 18.70 | 3.90 | 0 | 3 | 12.03 | 1.00 |

S7 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 496 | 21.52 | 6.70 | 0 | 4 | 9.13 | 8.50 |

S8 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 262 | 14.37 | 4.58 | 0 | 1 | 3.16 | 1.00 |

S9 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 183 | 6.23 | 6.97 | 0 | 0 | 0.00 | 0.00 |

S10 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 288 | 7.77 | 5.11 | 0 | 0 | 0.00 | 0.00 |

S11 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 717 | 24.71 | 5.03 | 0 | 3 | 9.41 | 1.00 |

S12 | 0 | 1 | 0.65 | 2.00 | 0 | 0 | 0.00 | 0.00 | 0 | 611 | 25.94 | 5.52 | 0 | 4 | 10.61 | 1.50 |

S13 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 156 | 5.53 | 4.93 | 0 | 0 | 0.00 | 0.00 |

S14 | 0 | 0 | 0.00 | 0.00 | 0 | 0 | 0.00 | 0.00 | 0 | 161 | 6.83 | 6.60 | 0 | 0 | 0.00 | 0.00 |

Mean | 0.00 | 0.29 | 0.24 | 0.50 | 0.00 | 0.14 | 0.12 | 0.18 | 0.00 | 326.43 | 19.18 | 5.27 | 0.00 | 1.71 | 9.86 | 1.29 |

M1 | 0 | 0 | 1.26 | 0.00 | 0 | 0 | 0.78 | 0.00 | 0 | 320 | 1.23 | 4.51 | 0 | 0 | 0.76 | 0.00 |

M2 | 0 | 0 | 1.50 | 0.00 | 0 | 0 | 1.95 | 0.00 | 0 | 738 | 3.70 | 4.44 | 0 | 2 | 3.47 | 1.00 |

M3 | 0 | 0 | 2.87 | 0.00 | 0 | 0 | 4.22 | 0.00 | 0 | 638 | 3.65 | 4.25 | 0 | 1 | 4.42 | 1.00 |

M4 | 0 | 0 | 2.12 | 0.00 | 0 | 0 | 3.09 | 0.00 | 0 | 577 | 2.88 | 5.20 | 0 | 1 | 3.08 | 1.00 |

M5 | 0 | 0 | 0.29 | 0.00 | 7 | 0 | 7.00 | 0.00 | 0 | 214 | 0.91 | 7.37 | 7 | 0 | 7.00 | 0.00 |

M6 | 0 | 0 | 2.45 | 0.00 | 0 | 0 | 1.81 | 0.00 | 0 | 394 | 3.25 | 4.50 | 0 | 0 | 2.57 | 0.00 |

M7 | 0 | 0 | 1.04 | 0.00 | 0 | 0 | 2.42 | 0.00 | 0 | 499 | 2.04 | 5.00 | 0 | 0 | 2.95 | 0.00 |

M8 | 0 | 0 | 0.40 | 0.00 | 0 | 0 | 2.25 | 0.00 | 0 | 336 | 3.30 | 5.48 | 0 | 1 | 3.22 | 1.00 |

M9 | 0 | 0 | 2.40 | 0.00 | 0 | 0 | 1.53 | 0.00 | 0 | 643 | 6.56 | 4.49 | 1 | 4 | 4.02 | 1.00 |

M10 | 0 | 0 | 3.70 | 0.00 | 0 | 0 | 4.16 | 0.00 | 1 | 637 | 6.13 | 5.30 | 1 | 2 | 5.73 | 1.50 |

Meann | 0.00 | 0.00 | 1.80 | 0.00 | 0.70 | 0.00 | 2.92 | 0.00 | 0.10 | 499.60 | 3.37 | 5.05 | 0.90 | 1.10 | 3.72 | 0.65 |

MT: Missing true haplotypes; FS: False haplotypes; RMSE: root mean square error; HD: Average Hamming distance, averaged over all false haplotypes. SHORAH_all shows the raw results and SHORAH_1% the results based only on those haplotypes with a frequency > 1%.

All methods found the correct sequence in each single-clone sample (Figure

The low Root mean square error (RMSE) between observed and expected frequencies of true haplotypes indicates that ET and KEC have a greater accuracy than SHORAH in single-clone samples. All three algorithms show equivalent results in the mixture samples (Figure

Analysis of the Hamming distance between false haplotypes and their closest match shows that false haplotypes retained by KEC and ET are genetically closer to true haplotypes than the ones retained by SHORAH (Figure

The test results of all samples are summarized in Table

Hepatitis C virus (HCV) is a single-stranded RNA virus belonging to the

In addition, current error-correction algorithms report performance measures related to their ability of finding true sequences, rather than the number of false haplotypes [

The highly non-random nature of 454-sequencing errors calls for internal controls tailored to the amplicon of interest. The error distribution of single-clone samples helped us to calibrate the ET algorithm, thus facilitating its high accuracy in detection of true sequences in the HVR1 amplicons. ET was successful in finding the correct set of haplotypes in all 10 mixtures and in 10 of 14 single-clone samples, while found one false haplotype in 4 single-clone samples. KEC was correct for 13 of 14 single-clone samples (with 2 false haplotypes for one sample) and for 9 of 10 mixtures (not being able to find low-frequency clones in the mixture M5), having also the advantage that it does not need an experimental calibration step. SHORAH found all correct haplotypes for all single-clone samples and for 9 of 10 mixtures, having a very large number of false haplotypes and a significant divergence of expected and found frequencies. Introduction of a frequency cutoff for SHORAH results in loss of true haplotypes. SHORAH with frequency cutoff 1% was correct for 5 single-clone samples and for 3 mixtures, having both missing true and false haplotypes for other samples.

We highly encourage the sequencing of single-clone samples of the desired amplicon in order to understand the nature and distribution of the errors and to measure the performance of the algorithm in this particular amplicon.

Most algorithms are successful in identifying and removing low-frequency errors. However, reads with high-frequency homopolymer errors should not be removed but rather corrected, allowing for preservation of valuable data. All three algorithms correct reads with homopolymers in a different way. KEC uses the k-mer distribution to discern between erroneous and correct k-mers and then fixes homopolymers using a heuristic algorithm. ET fixes the homopolymers based on pairwise alignments with high-quality internal haplotypes. SHORAH clusters reads with a similar sequence, effectively creating a consensus haplotype. Sample S4 is particularly interesting because it included a false haplotype with a raw frequency of 25.85%. This false haplotype contained a deletion in a long homopolymer (n = 7). Both KEC and ET fixed this haplotype, but the clustering algorithm SHORAH preserved this false haplotype because of its high frequency and made it a center for the cluster of other reads, achieving a final frequency of 33.25%. The same situation occurs in most single-clone samples: in samples S1, S2, S3, S4, S6, S8, S11, S12 the second-frequent haplotypes with frequencies 13.3%, 16.2%, 11.6%, 33.25%, 5.2%, 2.79%, 2.9%, 3.89%, respectively differs from most frequent haplotype by one indel in a long homopolymer. The main assumption of clustering algorithms is that the observed set of reads represents a statistical sample of the underlying population and that probabilistic models can be used to assign observed reads to unobserved haplotypes in the presence of sequencing errors [

SHORAH, ET and KEC are equally accurate in finding true haplotypes. However, new algorithms, KEC and ET, are more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones. Both algorithms are highly suitable for rapid recovery of high quality haplotypes from reads obtained by NGS of amplicons from heterogeneous viruses such as HCV and HIV.

The authors declare that they have no competing interests.

PS developed and implemented the KEC algorithm. ZD and DSC developed the ET algorithm. ZD implemented the ET algorithm. YK and GV designed the experimental section. LR and JY created the plasmid clones. GV and JCF prepared the samples for sequencing. AZ helped to run the SHORAH software. PS, ZD and DSC analyzed all data. DSC, PS and YK wrote the manuscript. All authors read and approved the final manuscript.

This article has been published as part of

We thank the Biotechnology Core Facility (Centers for Disease Control and Prevention) for successfully running our samples on the 454/Roche GS FLX instrument.