U50: A New Metric for Measuring Assembly Output Based on Non-Overlapping, Target-Specific Contigs

Castro, Christina J.; Ng, Terry Fei Fan

i

U50: A New Metric for Measuring Assembly Output Based on Non-Overlapping, Target-Specific Contigs

Supporting Files

Apr 18 2017
By Castro, Christina J. ; Ng, Terry Fei Fan

File Language:

English

Details

Alternative Title:

J Comput Biol
Personal Author:

Castro, Christina J. ; Ng, Terry Fei Fan
Description:

Advances in next-generation sequencing technologies enable routine genome sequencing, generating millions of short reads. A crucial step for full genome analysis is the de novo assembly, and currently, performance of different assembly methods is measured by a metric called N|. However, the N| value can produce skewed, inaccurate results when complex data are analyzed, especially for viral and microbial datasets. To provide a better assessment of assembly output, we developed a new metric called U|. The U| identifies unique, target-specific contigs by using a reference genome as baseline, aiming at circumventing some limitations that are inherent to the N| metric. Specifically, the U| program removes overlapping sequence of multiple contigs by utilizing a mask array, so the performance of the assembly is only measured by unique contigs. We compared simulated and real datasets by using U| and N|, and our results demonstrated that U| has the following advantages over N|: (1) reducing erroneously large N| values due to a poor assembly, (2) eliminating overinflated N| values caused by large measurements from overlapping contigs, (3) eliminating diminished N| values caused by an abundance of small contigs, and (4) allowing comparisons across different platforms or samples based on the new percentage-based metric UG|%. The use of the U| metric allows for a more accurate measure of assembly performance by analyzing only the unique, non-overlapping contigs. In addition, most viral and microbial sequencing have high background noise (i.e., host and other non-targets), which contributes to having a skewed, misrepresented N| value-this is corrected by U|. Also, the UG|% can be used to compare assembly results from different samples or studies, the cross-comparisons of which cannot be performed with N|.
Subjects:

Algorithms Article Contig Mapping Genome, Bacterial Genome, Viral Genome Assembly Genomics High-Throughput Nucleotide Sequencing Humans N50 Next-generation Sequencing Sequence Analysis, DNA Software U50
Source:

J Comput Biol. 24(11):1071-1080
Pubmed ID:

28418726
Pubmed Central ID:

PMC5783553
Document Type:

Journal Article
Funding:

CC999999/Intramural CDC HHS/United States
Volume:

24
Issue:

11
Collection(s):

CDC Public Access
Main Document Checksum:

urn:sha256:0a51bb7a3430bf85108086a6793f91fac1de6dcc30959469297d82b0d4ca5ff0
Download URL:

https://stacks.cdc.gov/view/cdc/51161/cdc_51161_DS1.pdf
File Type:

[PDF - 652.84 KB ]

File Language:

English

ON THIS PAGE

Details Supporting Files

CDC STACKS serves as an archival repository of CDC-published products including scientific findings, journal articles, guidelines, recommendations, or other public health information authored or co-authored by CDC or funded partners.

As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.