Efficient error correction for next-generation sequencing of viral amplicons

Skums, Pavel; Dimitrova, Zoya; Campo, David S; Vaughan, Gilberto; Rossi, Livia; Forbi, Joseph C; Yokosawa, Jonny; Zelikovsky, Alex; Khudyakov, Yury

i

Efficient error correction for next-generation sequencing of viral amplicons

Supporting Files Public Domain

Jun 25 2012
By Skums, Pavel ; Dimitrova, Zoya ; Campo, David S ; ...

Details

Alternative Title:

BMC Bioinformatics
Personal Author:

Skums, Pavel ; Dimitrova, Zoya ; Campo, David S ; Vaughan, Gilberto ; Rossi, Livia ; Forbi, Joseph C ; Yokosawa, Jonny ; Zelikovsky, Alex ; Khudyakov, Yury
Description:

Background

Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

Results

In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

Conclusions

Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.
Subjects:

Algorithms Cluster Analysis Computational Biology DNA, Viral Haplotypes Proceedings Sequence Analysis, DNA Viruses
Source:

BMC Bioinformatics. 2012; 13(Suppl 10):S6.
Document Type:

Journal Article
Volume:

13
Collection(s):

CDC Public Access
Main Document Checksum:

urn:sha256:53c8dfdc09e620bf8746f8be07ea2b0e3e19630c2bef06b559478eb67ebb1ff8
Download URL:

https://stacks.cdc.gov/view/cdc/10839/cdc_10839_DS1.pdf
File Type:

[PDF - 981.99 KB ]

1471-2105-13-S10-S6-5.gif

Download gif
1471-2105-13-S10-S6-5.jpg

Download jpeg
1471-2105-13-S10-S6-6.gif

Download gif
1471-2105-13-S10-S6-6.jpg

Download jpeg
1471-2105-13-S10-S6-7.gif

Download gif
1471-2105-13-S10-S6-7.jpg

Download jpeg
1471-2105-13-S10-S6-8.gif

Download gif
1471-2105-13-S10-S6-8.jpg

Download jpeg
1471-2105-13-S10-S6-9.gif

Download gif
1471-2105-13-S10-S6-9.jpg

Download jpeg
1471-2105-13-S10-S6-1.gif

Download gif
1471-2105-13-S10-S6.nxml

Download txt
license.txt

Download txt
1471-2105-13-S10-S6-1.jpg

Download jpeg
1471-2105-13-S10-S6-2.gif

Download gif
1471-2105-13-S10-S6-2.jpg

Download jpeg
1471-2105-13-S10-S6-3.gif

Download gif
1471-2105-13-S10-S6-3.jpg

Download jpeg
1471-2105-13-S10-S6-4.gif

Download gif
1471-2105-13-S10-S6-4.jpg

Download jpeg

ON THIS PAGE

Details Supporting Files

CDC STACKS serves as an archival repository of CDC-published products including scientific findings, journal articles, guidelines, recommendations, or other public health information authored or co-authored by CDC or funded partners.

As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.