Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

Timme, Ruth E.; Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.

doi:10.7717/peerj.3893

i

Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

Supporting Files Public Domain

Oct 06 2017
By Timme, Ruth E. ; Rand, Hugh ; Shumway, Martin ; ...

File Language:

English

Details

Alternative Title:

PeerJ
Personal Author:

Timme, Ruth E. ; Rand, Hugh ; Shumway, Martin ; Trees, Eija K. ; Simmons, Mustafa ; Agarwala, Richa ; Davis, Steven ; Tillman, Glenn E. ; Defibaugh-Chavez, Stephanie ; Carleton, Heather A. ; Klimke, William A. ; Katz, Lee S.
Description:

Background

As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines.

Methods

We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format.

Results

Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets.

Discussion

These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
Subjects:

Benchmark Datasets Bioinformatics E. Coli Epidemiology Evolutionary Studies Foodborne Outbreak Food Safety Food Science And Technology Genomics Listeria Phylogenomics Salmonella Validation WGS
Source:

PeerJ. 2017; 5
Pubmed ID:

29372115
Pubmed Central ID:

PMC5782805
Document Type:

Journal Article
Volume:

5
Collection(s):

CDC Public Access
Main Document Checksum:

urn:sha256:ec913bcf7210fb2157a9c9eb5b57b442093251dd9edbe9773e5409ed0c1631f1
Download URL:

https://stacks.cdc.gov/view/cdc/61280/cdc_61280_DS1.pdf
File Type:

[PDF - 1021.52 KB ]

File Language:

English

ON THIS PAGE

Details Supporting Files

CDC STACKS serves as an archival repository of CDC-published products including scientific findings, journal articles, guidelines, recommendations, or other public health information authored or co-authored by CDC or funded partners.

As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.