Background

Investig Genet

Investigative Genetics

2041-2223

BioMed Central

25101166

4123828

2041-2223-5-9

10.1186/2041-2223-5-9

Review

Validation of high throughput sequencing and microbial forensics applications

Budowle

Bruce

12bruce.budowle@unthsc.eduConnell

Nancy D

3nnconnell@me.comBielecka-Oder

Anna

4anna.bielecka@poczta.fmColwell

Rita R

5678rcolwell@umiacs.umd.eduCorbett

Cindi R

910Cindi.Corbett@phac-aspc.gc.caFletcher

Jacqueline

11jacqueline.fletcher@okstate.eduForsman

Mats

12mats.forsman@foi.seKadavy

Dana R

13DKadavy@signaturescience.comMarkotic

Alemka

14alemka.markotic@gmail.comMorse

Stephen A

15sam1@cdc.govMurch

Randall S

16rmurch@vt.eduSajantila

Antti

117antti.sajantila@helsinki.fiSchmedes

Sarah E

1sarah.schmedes@unthsc.eduTernus

Krista L

13kternus@signaturescience.comTurner

Stephen D

18sdt5z@virginia.eduMinot

Samuel

13sminot@signaturescience.com

1Department of Molecular and Medical Genetics, Institute of Applied Genetics, University of North Texas Health Science Center, Fort Worth, Texas, USA2Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia3Rutgers New Jersey Medical School, Center for Biodefense, Rutgers University, Newark, New Jersey, USA4Department of Epidemiology, The General K. Kaczkowski Military Institute of Hygiene and Epidemiology, Warsaw, Poland5CosmosID®, 387 Technology Dr, College Park, MD, USA6Maryland Pathogen Research Institute, University of Maryland, College Park, MD, USA7University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA8Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA9Bioforensics Assay Development and DiagnosticsSection, Science Technology and Core Services Division, National Microbiology Laboratory, Winnipeg, MB, Canada10Department of Medical Microbiology, University of Manitoba, Winnipeg, Canada11National Institute for Microbial Forensics & Food and Agricultural Biosecurity, Oklahoma State University, Stillwater, OK, USA12Division of CBRN Defence and Security, Swedish Defence Research Agency, Umeå, Sweden13Signature Science, LLC, Austin, TX, USA14University Hospital for Infectious Diseases “Fran Mihaljevic” and Medical School University of Rijeka, Zagreb, Croatia15Division of Foodborne, Waterborne, and Environmental Diseases, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Atlanta, Georgia16Virginia Tech, National Capital Region, Arlington, VA, USA17Department of Forensic Medicine, Hjelt Institute, University of Helsinki, Helsinki, Finland18Public Health Sciences, Bioinformatics Core Director, University of Virginia School of Medicine, Charlottesville, VA, USA

2014

3072014

599852014972014

2014

Budowle et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

High throughput sequencing (HTS) generates large amounts of high quality sequence data for microbial genomics. The value of HTS for microbial forensics is the speed at which evidence can be collected and the power to characterize microbial-related evidence to solve biocrimes and bioterrorist events. As HTS technologies continue to improve, they provide increasingly powerful sets of tools to support the entire field of microbial forensics. Accurate, credible results allow analysis and interpretation, significantly influencing the course and/or focus of an investigation, and can impact the response of the government to an attack having individual, political, economic or military consequences. Interpretation of the results of microbial forensic analyses relies on understanding the performance and limitations of HTS methods, including analytical processes, assays and data interpretation. The utility of HTS must be defined carefully within established operating conditions and tolerances. Validation is essential in the development and implementation of microbial forensics methods used for formulating investigative leads attribution. HTS strategies vary, requiring guiding principles for HTS system validation. Three initial aspects of HTS, irrespective of chemistry, instrumentation or software are: 1) sample preparation, 2) sequencing, and 3) data analysis. Criteria that should be considered for HTS validation for microbial forensics are presented here. Validation should be defined in terms of specific application and the criteria described here comprise a foundation for investigators to establish, validate and implement HTS as a tool in microbial forensics, enhancing public safety and national security.

Microbial forensicsValidationHigh throughput sequencingSample preparationLibrary preparationBioinformatics

Background

Microbial forensics involves analysis of microbe-related materials found at a crime scene, suspected laboratory, and so on, for forensic attribution and, thus, can be pivotal for developing investigative leads. Attribution (assigning to a source) can be defined as the characterization of a sample with the greatest specificity, which in the case of a microorganism would be at the species or strain level and ideally at the level of the isolate or even the culture vessel (for example, flask) from which the sample originated. High throughput sequencing (HTS) vastly improves the possibility that the forensic and scientific communities will be able to assign features to bio-forensic evidence, such as specific identity for unknown or emerging pathogens, sample or microbe origin, antibiotic sensitivity, evidence of genetic engineering and virulence profile. Now that a number of laboratories can afford HTS systems, community-accepted validation guidelines or standards are needed. As with any analytical tool(s) for forensic application, the utility of HTS operating conditions and tolerances and interpretation guidelines must be carefully defined. Guiding principles must be established to validate HTS systems. Here we define the criteria and offer a process for validation of HTS systems in microbial forensics. If methods are validated within the framework outlined here, microbial forensics will achieve an ever higher level of power and analytical value and, ultimately, greater protection for the public and the nation’s safety and security.

Introduction

More than a decade ago the United States experienced a simple but effective biological attack in which Bacillus anthracis endospores were placed in envelopes and delivered by the US postal service to intended victims [1-4]. The Federal Bureau of Investigation initiated the Hazardous Material Response Unit in 1996 to undertake a forensic investigation of bioterrorism events. Despite this effort, in 2001 the forensic infrastructure was inadequately prepared to analyze and interpret the available microbiological evidence to assist in determining who did and did not have the capacity to perpetrate such an attack. In fact, much of the needed forensic science applications had not yet been developed or validated. As part of an immediate national response to investigate such crimes, the field of microbial forensics was born [5-7] and its emergence was accelerated by the urgent requirement to investigate the anthrax mailing attacks.

The foundations of the field of microbial forensics lie in public health epidemiology and its practices [6-10] as well as agriculture practices [11-13]. Microbial forensics involves analysis of microbe-related materials found at a crime scene, suspected laboratory, and so on for forensic attribution (assigning to a source) and, thus, can be pivotal for developing investigative leads. Attribution in the case of microbial forensics can be further defined as the characterization of microorganisms within a sample to the species or strain level and ideally to the specific isolate or culture vessel from which the sample originated. Indeed, metagenomic approaches to assess microbial composition of samples also may provide strong microbial forensics evidence (either phylogenetically by identifying a specific target organism in the complex sample or by abundance spectrum profile) to attribute sample(s) to source. Scientific attribution also eliminates as many other candidate isolates or sources as possible and supports both investigation and legal proceedings.

The standards and requirements for microbial forensic practices are less well defined than those within human identification and other established forensic disciplines. However, establishing the validity of microbial forensic methods and their use and interpretation contributes to acceptance, admissibility, confidence, value and weight of physical evidence in the jurisprudence process [14] as well as within the military, intelligence and homeland security sectors that have the responsibility to act upon data and reports associated with suspected bioterror activities. Within two years following the anthrax letter attacks, the FBI’s Scientific Working Group for Microbial Genetics and Forensics (SWGMGF) formalized and published Quality Assurance (QA) guidelines [7]. The motivation for establishing a QA system was to put quality practices in place to ensure that microbial forensic evidence was analyzed using the best practices possible and that the interpretation of results was based on extant data and sound principles.

The SWGMGF QA guidelines were a good first step in establishing a QA system for microbial forensics and for increasing confidence in the data generated. However, as technologies advance and application of microbial forensics expands beyond the traditional law enforcement communities, it becomes increasingly important to continue to build upon the SWGMGF guidance and previously published microbial validation methods [7] to reflect the current state-of-the practice and foster greater community wide acceptance. Significant drivers to expand validation guidance include the substantial developments and applications of next-generation or HTS. For perspective, the first bacterial genomes that were sequenced in 1995 by the Institute of Genome Research (Rockville, MD, USA) [15,16] took more than three months to complete. Although HTS technology was initially developed, in part, for characterizing human genomes [17-19], these instruments have increasingly been used successfully to characterize unknown microbes in samples of varying complexity [20-42]. Within the field of microbial forensics [7,43,44], HTS combined with powerful bioinformatics capabilities offers a powerful tool to characterize forensic bio-evidence, including unknown microorganisms, genetically-engineered microorganisms and low-abundance (or trace) microorganisms present in complex mixed samples with extremely high sensitivity [45]. HTS technologies have features that make them more desirable and accessible for microbial forensic analyses than Sanger sequencing [46], including high throughput, reduced cost (on a per nucleotide or per genome basis) [47] and large-scale automation capability. Millions of sequencing reactions can be performed in a massively parallel fashion in a single instrument run [48-53]. With many copies sequenced at any desired portion of the genome (known as coverage), consensus sequence accuracy can be increased far beyond the per-read accuracy rate. As the throughput and accuracy of HTS continues to increase, more samples can be multiplexed in a single run without sacrificing depth of coverage or more complex samples may be analyzed at a greater depth of coverage.

Several HTS platforms are available and currently used for microbial sequencing, usually based on massively parallel sequence by synthesis strategies with high accuracy in a reduced footprint compared with Sanger sequencing. The primary HTS platforms include the HiSeq and MiSeq from Illumina (San Diego, CA, USA), the Ion PGM and Ion Proton Sequencers from ThermoFisher (South San Francisco, CA, USA) and the 454 systems from Roche (Pleasanton, CA, USA). The Illumina NextSeq 500 system is the latest platform on the market in this desktop category with 150 Gigabase throughput and ‘push-button simplicity’.

Another type of sequencing chemistry, developed by Pacific Biosciences (PacBio, Menlo Park, CA, USA), is the first to utilize single molecule real time (SMRT) sequencing, in which each base is detected in real time as a polymerase adds fluorescently tagged nucleotides along single DNA template molecules. SMRT sequencing is distinct from the other HTS technologies in providing very long read lengths. The average read length with the PacBio RS instrument is approximately 3,000 bp and can reach up to 20,000 bp [54]. Furthermore, examining the polymerase kinetics of SMRT sequencing allows for direct detection of methylated DNA bases [55]. This intrinsic capability of the SMRT sequencing workflow does not affect primary sequence determination, while yielding yet another forensic signature that is not captured with standard protocols on other HTS instruments.

A new and potentially revolutionary sequencing platform in development by Oxford Nanopore (Oxford, United Kingdom) will allow a single DNA molecule to pass through a protein nanopore set within an electrically resistant membrane bilayer. The resulting cross-membrane current fluctuations are used to identify the targeted nucleotide [56]. The company projects sequencing rates initially will be 20 bases per second, increasing to 1,000 bases per second in the future, and providing read lengths up to tens of thousands of bases [57]. While these individual reads will contain a larger number of errors than the other mentioned HTS instruments, the PacBio (and potentially the Oxford Nanopore) errors are random. With redundant interrogation of the same base of a circular template with SMRT sequencing and with sufficient depth of coverage, highly accurate consensus calls can be obtained [54].

HTS vastly improves the possibility that the forensic and scientific communities will be able to assign features (for example, strain identity, virulence profile, and so on) and, ultimately, attribution to bio-forensic evidence. However, these improvements cannot be realized or known with any level of statistical confidence without effective and validated bioinformatics tools to process, analyze and interpret the large amounts of HTS data generated. Most application-oriented laboratories are unlikely to have in-house bioinformaticians, and even for laboratories with such resources, a comprehensive data analysis pipeline must be defined and validated to establish that the software algorithm(s) reliably analyze sequence data and produce accurate final results. Many bioinformatic tools are available within commercial, academic and other open sources. However, the specific tools employed or developed are highly dependent on the need and intended use of that laboratory and may not have been rigorously tested. An appropriate data analysis pipeline must be implemented and fully validated, including understanding the uncertainty and error associated with each step of the process, as well as the collective uncertainty. The appropriate interpretation and weight of the evidence must be employed successfully and effectively communicated.

Now that laboratories are implementing HTS systems, community-accepted validation guidelines or standards are needed. Development of HTS technologies and associated bioinformatics tools will continue to progress rapidly, and, no doubt, increasingly powerful tools will be available to support microbial forensics. HTS applications for microbial forensics include assembly of draft and finished single genomes of microorganisms, targeted site sequencing, metagenomics (both amplicon sequencing of conserved genes for microbial community structure and shotgun sequencing for profiling the content of a sample), and source attribution, including profiling, sample comparison, sample engineering, and other microbial evolution or epidemiology applications. As with any analytical tool(s) for forensic application, the utility of HTS operating conditions and tolerances must be carefully defined. Regardless of the variation in technologies and software, guiding principles, such as the criteria listed in Table 1, must be established to validate HTS systems. Here we define the criteria and offer a process for validation of HTS systems in microbial forensics. Rather than delineating a set of protocols for a particular set of tools and reagents that apply to a limited set of instances, which may quickly become obsolete, those tools and reagents universally needed for protocol validation are described. By addressing each area described below, an investigator will be able to establish, validate and implement HTS as a tool for microbial forensics.

Table 1

Validation criteria for analytical performance metrics

Criteria	Definitions
Analytical sensitivity	Likelihood that the assay will detect a target (for example, organism variant, sequence region, functional element, and so on) in a sample (that is, target), if present; can include target attribution when defined as strain- or isolate-level detection. Also known as the true positive rate. Calculated by dividing number of true positives by the sum of true positive and false negatives (TP/(TP + FN)).
Analytical specificity	Likelihood that the assay will not detect a target, if not in the sample; can include false target attribution. Also known as the true negative rate. Calculated by dividing true negatives by the sum of true negatives plus false positives (TN/(TN + FP)). May be impractical to calculate for methods designed to detect the known universe of organisms.
Precision	The degree that individual measurements of the same sample are similar with regard to the presence and absence of target. Determined by the distribution of random errors and not the true or underlying value.
Accuracy	Degree that the material measured is similar to its true value. Calculated by (TP + TN)/(TP + FP + FN + TN).
Reproducibility	The degree to which the same result(s) is obtained for a sample when the assay is repeated between/among different operators and/or detection instruments.
Repeatability	The degree to which the same result(s) is obtained for a sample when the assay is repeated by the same operator and/or detection instrument.
Limit of detection	Minimum level of input material for a target as a proportion of the total at which all replicates are consistently positive for that target.
Reportable range	The region(s) of genome(s) that are sequenced and from which information is drawn for comparison or attribution.
False positive rate	The rate at which a target is incorrectly called as present. Also known as Type I error. Calculated as 1 – specificity
False negative rate	The rate at which a target organism is incorrectly called as absent. Also known as Type II error. Calculated as 1 – sensitivity.
Assay robustness	Stability of analytical performance under variable conditions, that is, likelihood of assay success.
Reference materials^a	Materials/samples used to test the performance of the assay (for example, reference panels of the target and mock or non-probative materials) relevant to the intended application of the assay.
Databases^a	Collection of data and reference genomes, genes and genomic elements to be used for interpretation of results.
Interpretation criteria for results^a	Analysis (quantitative or qualitative) used and confidence level of a result (match, association, most recent common ancestor, and so on).

^aThese last three items – Reference materials, Databases, and Interpretation criteria – typically have not been considered validation criteria. However, they have been included here primarily because interpretation of results is an essential part of generating reliable and appropriate results, which should be described within a standard operating protocol (SOP). The data used to test a system are reliant on reference materials and, depending on the situation, databases. See [58-62].

Application and validation of HTS for microbial forensics

Microbial forensic applications of HTS include single isolate sequencing with de novo assembly, read mapping, targeted sequencing of specified genes or other regions of interest (which generally include diagnostic markers, for example, SNPs, indels, and so on) [63,64], and metagenomics. Metagenomics analyzes by sequencing DNA (or RNA) samples to identify or describe microbial community composition of environmental samples such as soil [65], plants [41,42], sea water [66,67], hospital environments [68] and human-associated habitats [69,70]. HTS makes metagenomics readily feasible since culturing is not required for sample enrichment. HTS and associated bioinformatic technologies make it possible to detect microorganisms of interest when they are present in low abundance and differentiate them from near neighbors by using diagnostic genomic signatures.

Customers, stakeholders, the judicial system and the public expect forensic methods to be validated, when feasible, prior to use on forensic evidence. A validation process is essential in the development of methods for microbial forensics, and such methods must be reliable, defensible and fit for purpose.

Validation has been described as the process that:

1. Assesses the ability of procedures to obtain reliable results under defined conditions.

2. Rigorously defines the conditions that are required to obtain the results

3. Determines the limitations of the procedures.

4. Identifies aspects of the analysis that must be monitored and controlled.

5. Forms the basis for the development of interpretation guidelines to convey the significance of the findings [58].

While these general principles apply to HTS technologies and guidelines specifically for HTS used in metagenomic profiling already exist [71], there are challenges that arise when validating HTS for microbial forensics that require further consideration. Here we describe the specific guidelines for validating HTS technologies so that the microbial forensics community (and others) will have a common protocol and lexicon to leverage the exciting potential of HTS while maintaining high quality and confidence under rigorous scrutiny when this technology is used to support investigations of bioterrorism or biocrimes.

General considerations for validation

The requirements for validation will vary according to the process in question and should be defined in terms of the specific application. While full developmental and internal validation is ideal [7,58], this requirement may not be practical for all situations, such as an attack involving a novel agent not in previously validated systems. Indeed, the use of multilocus variable number of tandem repeat (VNTR) analysis [72] to determine that the strain of B. anthracis in the 2001 letter attack was Ames was not a fully validated procedure in casework analysis. Yet, it was sufficiently developed for investigative lead value [73].

Because of the vast and incompletely described biological diversity of microbes and the potential of having to deal with a large number of samples in a microbial forensic case, it is not possible to validate every scenario. Moreover, HTS and bioinformatics technologies are changing rapidly and will continue to be improved in the immediate and long-range future. Lastly, exigent circumstances may require immediate response, and microbial forensics should be able to lend support using all available tools. For such unforeseen circumstances preliminary validation may be ‘carried out to acquire limited test data to enable the evaluation of a method for its investigative-lead value, with the intent of identifying key parameters and operating conditions and of establishing a degree of confidence in the methods of collection, extraction, and analysis’ [74]. However, once general validation is accomplished for instrumentation, bioinformatics data analysis, and Standard Operating Protocols (SOPs), only novel aspects of validation for new targets may be needed to generate informative leads and to make public health decisions with associated levels of confidence. Therefore, it is extremely important to establish comprehensive criteria for validation of HTS technologies with all aspects of the validation study documented. The fact that a validation study is preliminary should be stated clearly, with the limitations of the assay and validation study clearly described. However, validation of finalized SOPs is essential for reliable and defensible use of HTS technologies in microbial forensics. Sample collection and storage have been addressed elsewhere [75] and will not be described here. Validation of the HTS process addressed here relies, in part, on reports available in the literature [59-61,76] that have defined validation requirements for HTS applied to human clinical genetic analyses. The validation guidelines for the three major technical components of HTS (sample preparation, sequencing and data interpretation) as related to the field of microbial forensics, are presented in the following sections.

Sample preparationNucleic acid extraction – quantity and purity

Validation should include anticipated sample types and matrices of those sample types. A range of routinely anticipated types of samples incorporating an array of quality and quantity of nucleic acids, environmental matrices, inhibitors of downstream analytical processes and biological contaminants expected to impact reliability, specificity and obtaining results, should be included.

Template DNA (or RNA, even though DNA is referenced here) must be of sufficient quantity and quality for library preparation and sequencing. The amount of DNA available will influence the library preparation method used. At the time of preparation of this manuscript, for example, the TruSeq (Illumina, Inc.) sequencing preparation method requires approximately 100 ng to 1 μg [77], Haloplex (Agilent, Santa Clara, CA, USA) 225 ng [78], Nextera XT (Illumina) 1 ng [79], and polymerase chain reaction (PCR)-based methods, though variable, may require less than 1 ng. Minimum and maximum DNA requirements for analysis should be established using a laboratory’s work flow. A set of guidelines is needed to establish what levels of prepared DNA may be insufficient or compromised and how to proceed under such circumstances (for example, analyze anyway, stop, or select an alternate assay). Metrics based on precise quantitative pre-analytical sample characterization are needed to assess the fraction of template molecules that meet the requirements for downstream analyses, which is important for amplicon sequencing and shotgun sequencing. It is likely that samples from which the DNA is insufficient, damaged and/or inaccessible will be encountered, especially when collected from the environment. This information will be helpful to assess and compare potential downstream partial and/or complete loss of target data. The DNA extraction method used should be tested for yield and sufficient purity for downstream analytical processes. Additional extraction processes may include separating a particular genome from a metagenomic sample or selective filtration to separate specific types of microbes, such as virus particles in a metagenomic sample [71,80] or methylated DNA from non-methylated DNA [81]. Since host DNA or background genome(s) may comprise a major component(s) of a given metagenomic sample, the ability to sequence minor components of complex samples may be affected. Purification procedures used to maximize the yield of targets of interest should be evaluated the same as the nucleic acid purification process. Lastly, proper positive and negative controls should be included to assess process performance and laboratory background contamination, respectively.

Enrichment and library preparation

DNA samples, single source or metagenomic, may be enriched for specific target regions of genomes using a capture approach or PCR. For many enrichment processes the desired genomic regions should be known and defined in order to design the protocol. However, whole genome amplification methods such as non-specific or degenerate PCR primers, [82,83] including multiple displacement amplification [84], can be used. The methods used for genome amplification can impact the results by introducing contaminating chimera formation and sequence bias [71], and should be considered, depending on the method or assay during validation.

Capture- and PCR-based methods have both advantages and limitations. PCR-based methods provide greater sensitivity of detection, but are likely to produce greater error from mis-incorporation by the polymerase than would be generated with a capture approach. PCR-based methods, in which a multiplex panel of markers may be considered, will require development of primer sets that amplify the targeted sites in a balanced fashion (or at least describe any significant imbalance) and do not cross-hybridize to unspecified targets. In contrast, capture methods will require more template DNA and would not provide the limit of detection necessary for microbial forensic analyses of trace materials. Regardless of the methods listed here or new ones subsequently introduced, it is incumbent upon the analyst to define validation criteria that address the advantages and limitations of enrichment.

Whether or not a sample is enriched, the next step in sample preparation is library preparation where the DNA sample is modified for sequencing. DNA is typically fragmented into shorter pieces by mechanical shearing (for example, sonication) or enzymatic fragmentation (for example, tagmentation [79,85]). Adapters are added to each fragment of DNA to facilitate clonal amplification prior to sequencing of the cloned fragments. Adapters can be incorporated into existing amplicon fragments during PCR. With long PCR amplicons, fragmentation may be required. DNA fragments and/or PCR amplicons then are size-selected for the range appropriate for down-stream sequencing and quality assessment. This process generates a library of millions of cloned fragments that are ready for sequencing. Quality must be assured by testing reproducibility of library preparations and robustness of indexing (described below) to identify (or misidentify) labeled fragments. Internal controls to monitor enrichment and library quality should be considered.

Multiplexing

Multiplexing with HTS can be achieved by barcoding (or indexing) [86,87]. Short unique sequence tags are added to every fragment of a sample during library preparation to ‘tag’ the fragments unique to a sample. Thereby, samples can be pooled (or multiplexed) and data separated (that is, demultiplexed) after sequencing, based on the unique tagged sequences. With the high throughput capacity afforded by HTS, many different samples may be sequenced simultaneously. For example, the MiSeq and Reagent Kit V2 (Illumina) is capable of generating more than 7.5 to 8.5 Gbp using a 2 × 250 paired-end run (about 39 hours sequencing run time). With 8 Gbp of sequence data, 16 samples can be multiplexed on a single run assuming desired 100× coverage of a 5 Mb bacterial genome (5e⁶ bp genome × 100× coverage × 16 samples = 8e⁹ bp MiSeq output). This calculation is just an example and will change as throughput and read lengths increase, which is likely to occur relatively quickly and often. As the throughput of HTS continues to increase, more samples could be multiplexed in a single run without sacrificing depth of coverage or more complex samples may be analyzed at a greater depth of coverage. In theory, hundreds to thousands of barcodes could be synthesized, but currently 12 to 384 different reference samples can be pooled in a single reaction [86,87]). The Earth Microbiome Project provides >2,000 barcodes that could be combined, theoretically enabling multiplexing of >4 million samples in a single sequencing run [88]. Depending on the target, for example, single source samples, the number of samples that can be barcoded and sequenced in a single run should be predictable. The performance of barcoding to identify specifically tagged samples should be evaluated and documented. Furthermore, when feasible, use of different indexes in sequential sequencing runs on an instrument can indicate if carry-over contamination has occurred, which offers another quality control for monitoring potential impact of contamination on sequencing results.

Sequencing

Each HTS system employs a unique chemistry for sequence determination and each will have to be validated in general and then specifically according to applicable features of the system [51,52,89-93]. For example, chemistries employed by 454 and Ion Torrent systems tend to be less accurate than Illumina-based chemistry for sequencing homopolymers. The type of sequencing, either single-end (fragments sequenced at one end only) or paired-end (both ends are sequenced) can impact coverage and stringency in different ways. Sanger sequencing, still considered the gold standard, allows for some concordance testing (that is, comparative analysis of the same target sequence with different analytical systems). However, there is no guarantee that the gold standard always provides the correct result. For example, Harismendy et al. [94] sequenced 266 kb of portions of six ion channel-related genes using Sanger sequencing, three HTS platforms, and one microarray platform and compared the results. The portion of false negative (FN) and false positive (FP) single nucleotide polymorphisms (SNPs) attributed to Sanger sequencing were 0.9% and 3.1%, respectively. Moreover, the lower throughput and coverage of Sanger sequencing makes it impractical for concordance testing with HTS generated data. The data generated by HTS are so much greater per run than those generated by Sanger sequencing that only limited sampling and very short regions can be reasonably compared. Instead concordance testing may be better achieved by testing orthogonal HTS systems with templates of ‘known’ genome sequence. Potential errors and biases inherent in each HTS system may be determined and documented better in this manner. For each sample type and platform, the error rate (and error profile) of sequencing can be determined only by empirical testing. The data can be used to define limitations of the current system that should be part of an interpretation SOP. In addition, orthogonal testing allows for identifying weaknesses and enables assay improvements before implementation. Where possible, orthogonal analyses should be employed for validating HTS methods.

Data analysis and interpretation

The final major components of HTS validation are data analysis and interpretation of results. Bioinformatics is essential and critical because of the massive amount of data, the requirement to answer forensic and investigative questions using the data, and the questions that may be addressed with trace samples, complex samples, potential genetic engineering, and background endemicity (that is, microorganisms that generally are found at a location). Comprehensive data analysis pipeline(s) should be defined and validated. It is important to establish that the software algorithms reliably analyze sequence data to produce accurate final results. The flow of data generally progresses through base calling, quality control and, finally, downstream taxonomic, functional and/or comparative analysis (which is generally either alignment- or fragment-based, if assembly is not performed) (Figure 1).

Figure 1

Basic schematic of data flow through an analysis process. The first step of base calling generally is completed by the instrument software, and each downstream step must be included in the validated analytical pipeline. Additional data processing after generating sequence reads is required, for example with contig building and/or alignment, and will depend on the application.

Quality metrics, generated during the analytical process, include: quality scores for base calling, read-level quality control (QC) (to include trimming of low quality bases on fragment ends), alignment, GC content, depth of coverage, strand bias and variant calling. Base calling, the identification of the specific nucleotide present at each position in a single read, should be part of instrument software. A quality threshold of base scoring is typically set with a Q score. A threshold of Q20 sets the minimum base call accuracy at 99% allowing for an incorrect base call per read at 1 in 100, while a Q30 score sets the accuracy at 99.9% and incorrect base call per read at 1 in 1,000 [95]. A Q score threshold should be set for both validation studies and subsequent implementation. However, there are no guidelines that suggest that, for example, a Q20 score is a requirement. A less than Q20 score may not impact accuracy as subsequent coverage and annotation may be adequate. Under defined conditions and for investigative leads or exigent circumstances the quality score may be relaxed; however, the justification or reliability of a lower score must be documented. Each HTS platform will have specific sequencing limitations and errors: signal-intensity decay over the read, erroneous insertions and deletions, strand bias, and so on. These limitations should be described and defined.Accuracy of identifying sequence variants (for example, SNPs, indels, chromosomal rearrangements, copy number variants) is dependent on a number of factors that include base calling and alignment as well as choice of reference genome, depth of sequence coverage (as well as average coverage), and sequence chemistry/platform. Because alignment involves arranging a read with a reference sequence (targeted or whole genome), different alignment strategies can and do produce different results (Figure 2). Differences in alignment will vary with software, so rules for alignment should be defined for consistency and traceability.

Figure 2

Alternate alignments of identical sequences. Reads 1 and 2 are aligned in equally optimal ways that indicate different locations for a 2 bp deletion relative to the reference. Differences in alignment can be problematic when an evidence sample’s consensus alignment is based on a different approach than that of the reference sample or entries in a database.

Choice of a reference genome, if used for alignment is important. Because the reference will vary by species and circumstance, specific criteria for selection are not provided here. However, it is incumbent upon the analyst to develop criteria for the reference genome that is selected. For example, from a microbiological perspective, a reference may be one that is accessible, is relevant as the Type strain, is relevant from a public health perspective, or is well-defined microbiologically; and from a computational perspective, it may be one, several or a collection of genomes, or the optimal computational sequence, and should be curated, such as a finished genome. Validation should define the purpose of the reference genome and describe the criteria for selection.

Minimum criteria should be established by the laboratory for the output of the selected analytical method, such as depth and uniformity of coverage. Defining output thresholds for metagenomic samples may be difficult given the immense quantity of data and microbial diversity; therefore, single source samples and defined mixtures can be used as a guide. These limitations may be necessary in defining FNs and FPs. Clearly, there will be ambiguous calls due to sequencing noise and novel genome composition. The specific parameters and settings used to establish thresholds, FP and FN rates should be detailed thoroughly to enable sound interpretation and accurate comparison to alternative methods and protocols.

Software may be adequate or somewhat limited with respect to the accuracy of variant calling. The limitations should be described and quantified, and algorithms may need to be modified to address specific limitations. The method(s) of identification and annotation should be described. Different formats are available for exporting variants and annotations [59]. The file format should include ‘a definition of the file structure and the organization of the data, specification of the coordinate system being used, e.g., the reference genome to which the coordinates correspond, whether numbering is 0-based or 1-based, and the method of numbering coordinates for different classes of variants, and the ability to interconvert to other variant formats and software’ [59].

The FP and FN rate often are determined for most analytical assays. However, there are additional considerations with HTS and microbial identification. Similar to homologous regions and pseudogenes for human genetic variation testing [60], the homologous regions of a near neighbor (for example, genes shared across the bacterial kingdom) become important for target identification (target being the species or strain of interest). Generating a metagenomic profile at the resolution of the phylum level, or even the genus level, may indicate a general environment from which a sample originates but often cannot identify the microorganism of interest at the species or strain level. However, newer approaches have started to achieve strain level identification [96-99] by exploiting higher throughput and novel algorithms. The results can be accurate and reliable and can translate into identification of the target agent in an unknown complex sample. Many reliable reads of the sequence of any particular species will share the same sequence, particularly so with near neighbors. For example, while Bacillus species may be sufficiently diverse to discriminate in a particular assay, strains of B. anthracis are nearly indistinguishable from one another [100]. FPs must be defined by specificity and the ability to phylogenetically differentiate a species (or strain) from near neighbors, such as Bacillus anthracis and Bacillus cereus. Testing that a known single source sample fits in a phylogenetic schema is not the same as identifying a particular species in a simple or complex sample. Methods for identification of targets should be validated based on intended use. FN rate may be difficult to determine for metagenomic sample analyses as stochastic effects and sampling variance may impact detection of the target(s). Known data sets can be helpful to define the FN rate.

Once assay conditions and pipeline configurations have been established, the entire method should be tested prior to use. Although individual components may have been validated, it is imperative to demonstrate that valid and reliable results are obtained when the components are combined. The standard microbial forensics validation criteria [7,58] apply to HTS methods as well. Special attention should be given to accuracy, precision, analytical sensitivity and specificity, reproducibility, limits of detection, robustness, reportable range, reference range, either FN/FP or confidence, statements of findings and databases used (Table 1). The laboratory must select and be able to clearly and defensibly state the parameters and thresholds necessary to determine whether the overall sequencing run is of sufficient quality to be considered successful. Criteria should include error rate, percentage of target captured, percentage of reads aligned, average and range of coverage depth, and so on.

Reference materials

Reference materials (RMs) should be used during test validation. Well-characterized reference samples should be included to establish baseline data to which future test modifications also can be compared [60]. Many different types of samples can serve as RMs for HTS, including characterized DNA derived from specimens prepared from microbial cultures, samples collected from several different endemic regions with high incidence of microorganisms of interest, samples from several non-endemic regions discovered accidentally and described as isolated outbreaks or findings, synthetic DNA (sets of sequences of known isolates), or electronic data (that is, generated in silico). A gold-standard reference genome would contain a single gap-less sequence for each chromosome or plasmid in the source organism, with no ambiguous bases. RMs are homogeneous for single source samples, stable and defined. Because complex mixtures are likely to be unstable and subject to stochastic effects, simple mixtures should be used. In silico complex samples, which can be considered stable, are suitable for testing the bioinformatics pipeline. The sequences used and parameters employed for testing should be documented. The same rationale can be applied to positive controls, which must be defined. Negative controls may include no-template controls, blank controls for different phases of the analytical process or DNA samples void of the target.

Bioinformatics software management

The bioinformatics community has not yet defined uniform guidelines or protocols for benchmarking software. Thus, users must fully validate and document their bioinformatics pipeline. Software may be open source, purchased from commercial entities, developed in-house, or come from a combination of sources. The software programs should perform general quality metrics assessment, but the software likely will differ in performance and potentially yield different results. Therefore, accurate versioning of the state of the software is essential [76], not just for validation but also for data analyses. The software and modifications must be tracked. Settings that can be modified by the user should be documented. Documentation also should include the specific version(s) of each component of the pipeline, the hardware, dates of use and changes to software. Each software upgrade requires revalidation of the steps downstream of HTS. Virtual Machines [101], which are software simulation(s) of a machine, encompass the entire computational environment used for analysis and can help accomplish comprehensive version control on this complete system. By maintaining informative curated reference datasets, validation of updates or changes to software pipelines may be facilitated without any additional HTS or with only minimal effort.

Analysis by computer software is an essential component of using HTS data. Two general criteria addressing software performance are verification and validation. According to the Institute of Electrical and Electronics Engineers (IEEE) Std 610.12-1990 [102], verification is ‘the process of evaluating a system or component to determine whether the products of a given development phase satisfy the conditions imposed at the start of that phase,’ and validation is ‘the process of evaluating a system or component during or at the end of the development process to determine whether it satisfies specified requirements.’ Since verification applies to whether the software system was built correctly and validation is whether the intended use was met, most users will only have tools at their disposal to perform a validation of software. To be considered validated, software must be tested using input data that replicate challenging real-world situations. Software can be integrated within the HTS process (for example, instrumentation) for data generation and external to the HTS system for interpretation (for example, phylogenetics, attribution, and so on). The software specifications should conform to the intended use [103]. Performance characteristics must define the entire process addressed by the software and individual process steps, as appropriate. Much of the above discussion on data generation addressed the criteria that relate to performance of data generation and subsequent interpretation and, thus, serves as a guide for software performance characteristics. Users who create software for intended applications should refer to the standards by the IEEE [102]. However, the majority of users in the application-oriented laboratory will obtain commercially-available software, and so validation likely will be at the ‘black box’ level. Even without access to the source code, users still are responsible for ensuring that the software performs the intended functions correctly. Regardless, since software requirements often are derived from overall system requirements for the HTS analytical tool, a systems approach is recommended for validation. The user’s intended use and needs [103] drive the criteria for validation. When possible, the user can rely on the manufacturer’s data for some of the validation, if the data are made available, or on the scientific literature. Nevertheless an internal validation demonstrating that performance criteria are met is required. Software is different than instrumentation in that it does not wear out [103], and likely will be modified for better performance over the lifespan of its use (however, the computer operating system can change, rendering the software incompatible with the newer system). Since software will be modified and updated, a validation analysis should be conducted for the specific change(s) following the same principles of validation. New problems may arise with the intended update and, therefore, any impact that modification may have on software beyond the change should be tested using a systems approach [103].

Data storage

Permanent storage of all raw HTS data is not practical as the raw data are exceedingly large. After base-calling, this information is routinely discarded. Sequence data should be in conventional, widely used format(s), for example, .fastq files for sequence reads, or be readily convertible to a standard format [59]. Many bioinformatics pipelines create output data structures that may be compressed and stored as an alternative to the .fastq read files. Other compression systems have been proposed for HTS data [104,105], but there may be utility in widely used methods like gzip compression that will likely remain in use for years to come. A best practice should be to create a process so that results can be re-analyzed as necessary when updates are made to the downstream bioinformatics systems. Other files to archive include: SAM/BAM (mapping data) and vcf (variants). These or similar formats could be used to store alignments and data about known SNPs of special diagnostic power, such as canonical SNPs. Where possible, DNA samples should be stored for re-testing [76]. Because of limitations with large amounts of data, it is necessary that reference datasets are documented and maintained in order to perform validation of future software updates. Lastly, conversion of data from one format to another could create unforeseen transcription errors; therefore, defined data sets should be tested before and after data conversion for accuracy.

Interpretation and reporting

Interpretation of results for attribution should be defined clearly and documented. Equally important, the level of resolution possible with a particular system should be stated. Also, the database(s) used for validation (and for casework analysis) is likely to be expanded and improved with HTS technologies on a relatively rapid basis; so the records of the database(s) used for individual analyses must be maintained. The target areas that define a species or strain and resolve it from near neighbors are critical [100,106]. One or more sites may be required depending on phylogenetic resolution. A minimum number of targets and degree of confidence with the number of targets should be established [107]. The means by which a software pipeline determines attribution may not be accessible to the user, in which case all relevant output data and associated thresholds should be documented and stored in a standard way according to the SOP. The minimum number of reads is essential for limits of detection, stochastic effects, and FNs and FPs and should be defined empirically for obtaining a reliable result(s). An interpretation statement(s) and degree of confidence (qualitative or quantitative) should be developed regarding attribution of the sample, and that confidence, when feasible, should be based in a rigorous statistical framework.

Resequencing assembly can be effective if the reference dataset contains sequences of closely related reference genomes [71]. De novo sequencing is computationally more demanding. Thus, the software and, just as importantly, reference data sets are critical to result quality. There are a number of assembly algorithms that can take millions of short reads generated by HTS and translate them into a portion or complete genome sequence [108-112]. Each approach has benefits and limitations affecting quality and efficiency. Therefore, specific software used, standard metrics (for example, N50, coverage, contig sizes) assumptions and criteria applied should be documented [113,114].

While there may be some situations in which assembly is required, it is less likely to be used or even necessary for the foreseeable future in microbial forensics analyses and especially with mixtures or metagenomic samples where near neighbors, strains and coverage constraints reduce the practicality of assembly. Alignment strategies or fragment counting strategies are preferable to assembly. Sequence alignment compares DNA sequences (although it can apply to any kind of sequence data) and seeks to identify regions of homology. More often a reference sequence will be maintained, targeted short reads will be aligned with that reference, and differences with respect to the reference will be listed (as ‘variants’) [115-117]. In addition to the same documentation requirements for assembly strategies, the reference sequence should be fully documented. While we believe that alignment strategies will be favored over assembly strategies for metagenomic microbial forensic applications, if capabilities improve that enable effective assembly, then it is incumbent upon the user to validate the process.

Taxonomic assignment

Methods for read-based taxonomic classification of metagenomics data fall into two broad categories: composition-based and alignment-based. Composition-based approaches rely on comparing signatures of short motifs from a query fragment to a reference genome -- for instance, a particular GC content, gene and protein family content, or k-mer frequency and distribution [71]. Composition based approaches include Phylopythia [118], PhylopythiaS [119], Phymm [120], the Naive Bayes Classifier [121], Sequedex [122], the Livermore Metagenomic Analysis Toolkit (LMAT) [97], GENIUS [96] and Kraken [99]. Alignment-based approaches compare reads to a set of labeled reference genomes using a basic local alignment search tool (BLAST)-based approach. Alignment based approaches include MEGAN, Bowtie, MetaPhlAn, MetaPhyler, CARMA, WebCARMA, IMG/M, MG-RAST, and others [98,116,123-132]. Additionally, methods for direct taxonomic classification of sequencing reads use a combination of both composition and sequence similarity approaches, such as MetaCluster [133], Rapid Identification of Taxonomic Assignments [134], and PhymmBL [127,128,135]. A more comprehensive review of sequence classification methodology and software is presented elsewhere [136].

Many programs use a phylogenetic approach to classify sequences and summarize results by taxonomic group. A sequence(s) can be assigned at any level from the phylum down to the species and strain. The output of the program may potentially assign a sequence(s) to any taxonomic level. Most commonly, a program will summarize the overall abundance of each taxonomic level it detects. If a species is detected and no other higher resolving sequence data are available, then strains within that species cannot be resolved based on that sequence data. Many programs may achieve assignment to the genus level, but not to species level attribution. Some programs conduct classification down to either genus or species, while other programs will assign to a variety of levels depending on the level of specificity of the input data. Programs designed to make assignment at the strain level for bacteria will need to be validated for that level of specificity as well as congruency with genus and species level summaries. Viral strain assignment poses additional challenges, as some viruses (for example, RNA viruses) can have high rates of mutation and form quasi-species for which no clear reference genomes are available [107]. Bacterial and virus level assignments are likely to improve as the number of sequenced microbial genomes continues to increase. Since phylogenetic assignments are based on extant data, the databases and software (and version) used to perform the phylogenetic analyses should be documented.

Software typically is run with thresholds for assignment likelihood that can be set at either the initiation of analysis or at the time of interpretation of output. The thresholds used for analysis should be defined and documented thoroughly. Documentation should include the step(s) at which thresholds are specified, either by user input, within configuration files, in output interpretation, or at any other step in the analytical process. Thresholds should not be assumed to be equivalent between programs or within different versions of the same program, as every step of the analysis can impact the odds or strength of assignment. While many thresholds for taxonomic assignment are set automatically, the user has a responsibility to design experiments that test the impact of thresholds on the output of known samples on taxonomic assignment and set those thresholds accordingly.

Abundance levels

The most basic measure of the abundance of an organism in a sample is binary abundance (that is, presence/absence). Proportional abundance provides information on a continuous scale, but usually does not accurately convey relative level of abundance. The current state-of-the-art is generally composed of abundance measures with no associated confidence values [97,122,127]. Because of the complex nature of metagenomic samples a stochastic threshold (or minimum abundance threshold) for detection should be implemented. The abundance threshold can be set empirically to where anything above that value is present and anything lower (below the limit of detection) is either inconclusive, not detected, or absent, but then should be used consistently to measure corresponding error rates. The degree of accuracy is tied to the threshold of detection that is set. Internal standards are useful. Most studies to date have collected metagenomic data in a relative framework, in which abundance of genes or messages is calculated as percent or proportion of the sample content. However, the abundance level can be more accurate if internal genomic DNA is added at the sample processing stage. If these control molecules are mixed into and processed alongside the sample-derived nucleic acids, more effective quantification and inter-sample comparisons may be performed. Internal controls also may provide information on the extent or directionality of changes in any particular gene or organisms present. For example, in tracking a particular source of a contamination, measuring a gradient pointing towards the source may be useful. When drawing a conclusion that the presence of a microorganism is, for example, inconclusive or absent, it should be stated as being below the limit of detection that is determined both by the amount of sequence data and the parameters at which the analysis program was benchmarked.

Organism classification

Taxonomic classification of bacteria can sometimes create the misconception that microbial species are discrete and unique entities. Rather, some species are extremely closely related to each other and may form a continuum that is not readily resolved, while others are extremely distant from other microorganisms and can be categorized effectively [106]. Unfortunately, some separately named strains have almost identical genomes, while others are more dissimilar than some pairs of species. Therefore, when evaluating the power with which genomics can be used to distinguish between and among microorganisms and, thereby, define attribution under the circumstance of the analysis (for species to strain level identification or for determining similarity between two or more samples), it is important to understand the level of genomic similarity that they share (with known diversity of extant isolates). Also, the diversity in sequence within a species should be appreciated.

When constructing a test dataset for benchmarking, a decision first must be made regarding the level of genomic discrimination required. The level of genomic discrimination will likely be based on a list of known microorganisms of interest compared to their near neighbors. Whether that value is 5%, 1%, 0.1% or less, the microorganisms used for thresholding must have degrees of similarity consistent with that threshold. When calculating the similarity of two genomes, there are at least two methods that could be used: 1) calculating the similarity of regions and genes that are shared, or 2) normalizing that alignment value to the proportion of each genome that can be aligned. The second approach may account for plasmids or horizontally-transferred elements that may distinguish two strains of the same species. However, those strain-specific genes or regions may not provide any added discriminatory power to an algorithm depending on how it is constructed. One approach may be the percent identity of common (shared) genes or regions to characterize the similarity of different genomes, so that the relationship of strains with a high degree of similarity within the core genome is not confounded by the presence of mobile elements. The performance of an algorithm should be presented only in the context of the degree of similarity between the organisms used for validation, with probability estimate, if possible.

Another strategy for selecting microorganisms for benchmarking is to use specific microorganisms that are of particular interest. For example, discriminating between a threat agent (such as B. anthracis) and a close relative (such as B. cereus) may be a higher priority than discriminating between all known species that are differentiated by at least 1%. It is important to note that such a specific target approach cannot be applied to benchmarking studies of other microorganisms as they may not, and likely will not, have a comparable level of genomic dissimilarity. The documented goal(s) of the user will determine whether the validation is designed to assess global similarity measures or the similarity of specific target organisms to their near neighbors.

Community structure

In addition to containing many different microorganisms, whether the same ones or very different ones, metagenomic samples will differ dramatically according to the relative abundances of microorganisms comprising the sample. Abundances of each microorganism (or taxonomic level of resolution) will vary widely, so that performance will be judged across orders of magnitude. It is difficult to predict how the presence of one microorganism may modulate the detection of another (due to similar elements in those genomes and power of discrimination of the bioinformatic method). The relative abundances of each organism can be varied across a number of replicates if the method lacks discriminatory power. This evaluation is performed best in silico.

The output data from a series of validation tests should consist of a set of records containing:

1. Microorganism (or taxonomic level resolved).

2. Known abundance, for example, controls.

3. Measured abundance (either proportional or binary).

4. If possible, a confidence measure (or qualitative/quantitative statement).

Sets of independent tests and repetitive tests will allow for summary statistics to be applied for assessing attribution capabilities, as well as the performance of the analytical system as a whole. Since empirical data generation is demanding and costly, the use of simulation data is strongly recommended. Power testing also can be defined, based on the number of samples to be analyzed. Comparisons of abundance values of microbes in two or more samples may be used for potentially indicating association [137]. Relevant to such analyses may be population genetic quantities, such as alpha and beta diversities [138]. The appropriate criteria for abundance distributions and comparisons should be established during validation. Current software may perform such data analyses to a degree and it is anticipated that novel programs will become available.

Rates of FPs and FNs are important measures and correspond to the sensitivity and specificity of the assay. If a proportional abundance measure is given, an abundance threshold should be set to render an interpretation of presence/inconclusive/absence. If a confidence measure is given, a more stringent threshold can be used along that dimension as well. Threshold values are dependent on the parameters of the sequencing run, as well as the program used and reference database. A validation process that establishes confidence values for a particular set of output data will only be applicable to other samples that are processed on the same platform, using the same settings (read length, and so on), filtered and processed with the same Q-score cutoffs, and then analyzed with the same taxonomic assignment program run with identical settings. This process is extremely important because the results of the validation process cannot be extended directly to an analysis in which any of those parameters have been changed or do not match.

The accuracy of proportional abundance can be measured with a correlation coefficient, either parametric (for example, Pearson) or nonparametric (for example, Spearman). Pearson’s test could indicate how closely the absolute values generated resemble the known composition, while Spearman’s test could indicate how closely the generated rank-order of each organism resembles the known composition. The utility of a program in determining the proportional abundance of individual microorganisms within a sample depends on the value of the correlation coefficient with data for controls included in the analysis. However, for many forensic applications the relative abundance of an organism is far less important than the presence or absence of that organism, along with designation to the strain level of identification. Nevertheless, for applications in which relative abundance is to be reported with confidence, thorough validation must satisfy all requirements of a binary presence analysis, with the added dimension of the correlation coefficient.

Standard operating protocols or procedures

All validated assays require SOPs, which must be based on the results of validation that encompass all appropriate aspects of the assay process, including but not limited to: sample collection and storage, nucleic acid extraction, enrichment, library preparation, sequencing, data analysis and interpretation of results. SOPs for implementation of HTS as a diagnostic tool include: (1) standardization; (2) optimization; (3) validation; and (4) automation [139]. These concepts, while initially developed for HTS-based microbial clinical diagnostics, apply equally to developing HTS SOPs for microbial forensics. Standardization, in this context, requires selecting a set of methods, software and workflows, along with setting thresholds for making a forensic interpretation based on features present in the HTS data set. SOPs themselves must be validated, ideally with blinded prospective studies using static data analysis workflows. Finally, data analysis and interpretation SOPs ideally should be fully automated, if possible, to reduce user-configurable parameters to a minimum [139].

Conclusions

Conveying confidence in a test or process is essential in microbial forensics because the consequences are serious and the conclusions must be based on data and resultant interpretations of evidence in the case of a biothreat event. Therefore, the limitations of methods used in microbial forensics to generate results must be reliable and defensible and the process(es) of validation will contribute substantially in defining confidence associated with an assay, method, or system. HTS is an invaluable tool, expanding the scope and power of microbial forensics to provide protection against and response to attacks with biological agents. The HTS process was described in some detail herein so that analysts, who are not experienced researchers, will have guidance on the features and criteria that should be addressed during a validation. An outline of the HTS validation criteria is provided in the list of elements below. The reader may consider such validation quite challenging. However, similar demands have been in place for forensic human identification and the benefits to that forensic science community outweigh the task of validation. It is difficult to lay out the highest priority or near-term goals here as these may vary with the test or application and, therefore, such decisions are left to the community of users. To accomplish a validation the investigator should develop criteria as he or she requires for each situation. However, the criteria and the results and conclusions from validation studies must be available for inspection by appropriate parties.

List of elements to consider during validation of HTS for microbial forensics<sup>a</sup>

I. Sample Preparation

a. Template (DNA or RNA) quantity and quality

i. Minimum and maximum requirements

ii. Guidelines for action when these values fall out of range

b. Enrichment

i. Desired genomic regions for enrichment

ii. Limitations of the chosen method (for example, introduces known bias, increases error) and specific circumstances for its justified use

c. Library preparation

i. Quality, sensitivity, reproducibility and robustness of library preparation method(s) across expected sample types

d. Multiplexing

i. Performance of barcoding to identify specifically tagged samples

II. Sequencing

a. System features

i. Platform (if feasible, multiple orthogonal platforms)

ii. Chemistry

iii. Quality metrics

iv. Limitations

1. Error

2. Signal-intensity decay

3. Erroneous insertions/deletions

4. Strand bias

5. Potential for carry over contamination

III. Data analysis

a. Bioinformatics pipeline

i. Functions

ii. Quality metrics

1. Variant/sequence identification

2. Q score

3. Coverage

4. Error

5. Allele call (SNP state, indel state, and so on)

6. Threshold

7. False positive and false negative rates

iii. Reference standard

1. Variant calling

2. Gene or functional element assignment

iv. Alignment- or composition-based software

1. Functions

2. Rules for alignment

v. Phylogenetics software

1. Functions

b. Bioinformatics software management

IV. Controls

a. Level of acceptable characterization

b. Intended use

V. Reference materials

VI. Databases

VII. Interpretation

a. Sample type

i. Single source

ii. Complex or metagenomic

1. Abundance

iii. FP and FN rates

b. Attribution

i. Taxonomic assignment

ii. Association

iii. Reverse engineering

1. Sample preparation

2. Genetic engineering

iv. FP and FN rates

c. Quantitative/Qualitative statements

i. Confidence

VIII. SOPs

a. Sample preparation

i. Extraction

ii. Enrichment

iii. Reverse transcription (if necessary)

b. Library preparation

c. Sequencing

d. Bioinformatics pipeline

i. Data analysis

ii. Data storage

iii. Data transfer

iv. Interpretation

^aIt is not possible to generate an all-inclusive element list because of the wide diversity of samples, sample types, chemistries, platforms, and bioinformatics for which HTS methods may be applied. Therefore, this outline serves as a guideline, rather than an exhaustive or prescriptive regulation. The user should evaluate these elements, select those that apply, justify why some elements were not applied, and add any elements that are method specific and not included in this outline.

The HTS validation process should, at a minimum: 1) ensure that appropriate reference and benchmarking datasets are used to establish FP and FN values within a rigorous statistical framework; and 2) require the practices, chemistries, settings, and bioinformatics programs used to generate actionable results be thoroughly documented and standardized, at least within the specific laboratory employing the method(s). It is important to remember that identification to species and strain is highly dependent on phylogenetic similarity of near neighbors used for comparison. Consequently, the validation of a process to detect a given species or strain cannot be applied indiscriminately to additional target organisms without additional validation. The ability of a process to identify to species level varies across the tree of life, and validation processes must take the phylogenetic framework into consideration.

The validation process described herein allows for translation of research tools to forensic applications so that HTS can provide the reproducibility and specificity necessary to stand up to the full weight of legal scrutiny. In addition to validation, the laboratory is urged to adopt an overall quality management and quality assurance system to provide a working foundation essential for microbial forensics, maintaining good laboratory practices and bolstering confidence in results.

As HTS technologies continue to advance, costs will continue to drop, instruments will become faster, smaller, eventually portable, and their applications continue to increase. Advanced sequencing technologies will begin to be applied to measuring DNA modifications, epigenetic factors and offer yet another layer of specificity. With longer reads, genetically engineered organisms will be detected, most likely by identification of unexpected adjacency of genomic elements. The validation criteria described in this paper may likely apply to the new analytical flourishes in the coming years and, therefore, provide a stable foundation for future implementation in microbial forensics. If methods are validated within the framework outlined here, microbial forensics will achieve an ever higher level of power and analytical value and, ultimately, greater protection for the public and the nation’s safety and security.

Abbreviations

bp: base pair; FN: false negative; FP: false positive; HTS: high throughput sequencing; IEEE: Institute of Electrical and Electronics Engineers; PCR: polymerase chain reaction; QA: quality assurance; RMs: reference materials; SMRT: single molecule real time; SNPs: single nucleotide polymorphisms; SOPs: standard operating protocols; SWGMGF: Scientific Working Group for Microbial Genetics and Forensics.

Competing interests

The authors declare that they have no competing interests. Ethics: No IRB approval was required for this work.

Authors’ contributions

BB and SM conceived and developed the topic and wrote the majority of the manuscript; NC, AB, RC, CC, JF, MF, DK, AM, SAM, RM, AS, SS, KT and ST provided input and review. All authors read and approved the final manuscript.

Financial disclosure

The affiliated primary institutions and agencies supported this study by providing salaries for the authors. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Bush

Abrams

Beall

Johnson

Index case of fatal inhalational anthrax due to bioterrorism in the United States

N Engl J Med200134516071610

11704685

Traeger

Wiersma

Rosenstein

Malecki

Shepard

Raghunathan

Pillai

Popovic

Quinn

Meyer

Zaki

Kumar

Bruce

Sejvar

Dull

Tierney

Jones

Perkins

Team

First case of bioterrorism-related inhalational anthrax in the United States, Palm Beach County, Florida, 2001

Emerg Infect Dis2002810291034

12396910

Jernigan

Stephens

Ashford

Omenaca

Topiel

Galbraith

Tapper

Fisk

Zaki

Popovic

Meyer

Quinn

Harper

Fridkin

Sejvar

Shepard

McConnell

Guarner

Shieh

Malecki

Gerberding

Hughes

Perkins

Anthrax Bioterrorism Investigation Team: Bioterrorism-related inhalational anthrax: the first 10 cases reported in the United States

Emerg Infect Dis20017933944

11747719

Hsu

Lukacs

Handzel

Hayslett

Harper

Hales

Semenova

Romero-Steiner

Elie

Quinn

Khabbaz

Khan

Martin

Eisold

Schuchat

Hajjeh

Opening a bacillus anthracis-containing envelope, Capitol Hill, Washington, D.C.: the public health response

Emerg Infect Dis2002810391043

12396912

Murch

Forensic perspective on bioterrorism and bioproliferation

Firepower in the Laboratory. Proceedings of the Symposium on Research Needs for Laboratory Automation and Bioterrorism2001

Washington DC: National Academy of Sciences Press

203214

Murch

Microbial forensics: building a national capacity to investigate bioterrorism

Biosecur Bioterror20031117122

15040189

Budowle

Schutzer

Einseln

Kelley

Walsh

Smith

Marrone

Robertson

Campos

Public health

Building microbial forensics as a response to bioterrorism. Science200330118521853

Morse

Budowle

Microbial forensics: application to bioterrorism preparedness and response

Infect Dis Clin North Am200620455473

16762747

Flowers

Mothershead

Blackwell

Bioterrorism preparedness. II: the community and emergency medical services systems

Emerg Med Clin North Am200220457476

12120487

Morse

Kellogg

Perry

Meyer

Bray

Nichelson

Miller

Detecting biothreat agents: the Laboratory Response Network

ASM News200369433437

Fletcher

Gullino M, Fletcher J, Gamliel A, Stacks J

The need for forensic tools in a balanced national agricultural security program

Crop Biosecurity: Assuring Our Global Food Supply2008

Springer Science + Business Media B.V

93101

Fletcher

Barnaby

Burans

Melcher

Ochoa Corona

Budowle B, Schutzer SE, Breeze RG, Keim PS, Morse SA

Forensic plant pathology

Microbial Forensics2010

Elsevier Inc

89105

Fletcher

Bender

Budowle

Cobb

Gold

Ishimaru

Luster

Melcher

Murch

Scherm

Seem

Sherwood

Sobral

Tolin

Plant pathogen forensics: capabilities, needs, and recommendations

Microbiol Mol Biol Rev200670450471

16760310

Harmon

Budowle B, Schutzer SE, Breeze RG

Admissibility standards for scientific evidence

Microbial Forensics2005

Academic Press

382392

Fleischmann

Adams

White

Clayton

Kirkness

Kerlavage

Bult

Tomb

Dougherty

Merrick

McKenney

Sutton

FitzHugh

Fields

Gocayne

Scott

Shirley

Liu

Glodek

Kelley

Weidman

Phillips

Spriggs

Hedblom

Cotton

Utterback

Hanna

Nguyen

Saudek

Brandon

Whole-genome random sequencing and assembly of Haemophilus influenzae Rd

Science1995269496512

7542800

Fraser

Gocayne

White

Adams

Clayton

Fleischmann

Bult

Kerlavage

Sutton

Kelley

Fritchman

Weidman

Small

Sandusky

Fuhrmann

Nguyen

Utterback

Saudek

Phillips

Merrick

Tomb

Dougherty

Bott

Lucier

Peterson

Smith

Hutchison

3rdVenter

The minimal gene complement of Mycoplasma genitalium

Science1995270397403

7569993

Venter

Adams

Myers

Mural

Sutton

Smith

Yandell

Evans

Holt

Gocayne

Amanatides

Ballew

Huson

Wortman

Zhang

Kodira

Zheng

Chen

Skupski

Subramanian

Thomas

Zhang

Gabor Miklos

Nelson

Broder

Clark

Nadeau

McKusick

Zinder

The sequence of the human genome

Science200129113041351

11181995

Lander

Linton

Birren

Nusbaum

Zody

Baldwin

Devon

Dewar

Doyle

FitzHugh

Funke

Gage

Harris

Heaford

Howland

Kann

Lehoczky

LeVine

McEwan

McKernan

Meldrim

Mesirov

Miranda

Morris

Naylor

Raymond

Rosetti

Santos

Sheridan

Sougnez

Initial sequencing and analysis of the human genome

Nature2001409860921

11237011

Wheeler

Srinivasan

Egholm

Shen

Chen

McGuire

Chen

Makhijani

Roth

Gomes

Tartaro

Niazi

Turcotte

Irzyk

Lupski

Chinault

Song

Liu

Yuan

Nazareth

Qin

Muzny

Margulies

Weinstock

Gibbs

Rothberg

The complete genome of an individual by massively parallel DNA sequencing

Nature2008452872876

18421352

Brzuszkiewicz

Thürmer

Schuldes

Leimbach

Liesegang

Meyer

Boelter

Petersen

Gottschalk

Daniel

Genome sequence analyses of two isolates from the recent Escherichia coli outbreak in Germany reveal the emergence of a new pathotype: Entero-Aggregative-Haemorrhagic Escherichia coli (EAHEC)

Arch Microbiol2011193883891

21713444

Chin

Sorenson

Harris

Robins

Charles

Jean-Charles

Bullard

Webster

Kasarskis

Peluso

Paxinos

Yamaichi

Calderwood

Mekalanos

Schadt

Waldor

The origin of the Haitian cholera outbreak strain

N Engl J Med20113643342

21142692

Cummings

Bormann Chung

Fang

Barker

Brzoska

Williamson

Beaudry

Matthews

Schupp

Wagner

Birdsell

Vogler

Furtado

Keim

Budowle

Accurate, rapid and high-throughput detection of strain-specific polymorphisms in Bacillus anthracis and Yersinia pestis by next-generation sequencing

Investig Genet201015

Grad

Lipsitch

Feldgarden

Arachchi

Cerqueira

Fitzgerald

Godfrey

Haas

Murphy

Russ

Sykes

Walker

Wortman

Young

Zeng

Abouelleil

Bochicchio

Chauvin

Desmet

Gujja

McCowan

Montmayeur

Steelman

Frimodt-Møller

Petersen

Struve

Krogfelt

Bingen

Weill

Lander

Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011

Proc Natl Acad Sci U S A201210930653070

22315421

Eisen

Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes

PLoS Biol20075e82

17355177

Hasan

Choi

Eppinger

Clark

Chen

Alam

Haley

Taviani

Hine

Tallon

Prosper

Furth

Hoq

Fraser-Liggett

Cravioto

Huq

Ravel

Cebula

Colwell

Genomic diversity of 2010 Haitian cholera outbreak strains

Proc Natl Acad Sci U S A2012109E2010E2017

22711841

Hendriksen

Price

Schupp

Gillece

Kaas

Engelthaler

Bortolaia

Pearson

Waters

Upadhyay

Shrestha

Adhikari

Shakya

Keim

Aarestrup

Population genetics of Vibrio cholerae from Nepal in 2010: evidence on the origin of the Haitian outbreak

MBio20112e0015711

21862630

Holt

Parkhill

Mazzoni

Roumagnac

Weill

Goodhead

Rance

Baker

Maskell

Wain

Dolecek

Achtman

Dougan

High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi

Nat Genet200840987993

18660809

Hornstra

Priestley

Georgia

Kachur

Birdsell

Hilsabeck

Gates

Samuel

Heinzen

Kersh

Keim

Massung

Pearson

Rapid typing of Coxiella burnetii

PLoS One20116e26201

22073151

Howden

McEvoy

Allen

Chua

Gao

Harrison

Bell

Coombs

Bennett-Wood

Porter

Robins-Browne

Davies

Seemann

Stinear

Evolution of multidrug resistance during Staphylococcus aureus infection involves mutation of the essential two component regulator WalKR

PLoS Pathog20117e1002359

22102812

Köser

Ellington

Cartwright

Gillespie

Brown

Farrington

Holden

Dougan

Bentley

Parkhill

Peacock

Routine use of microbial whole genome sequencing in diagnostic and public health microbiology

PLoS Pathog20128e1002824

22876174

MacLean

Jones

Studholme

Application of “next-generation” sequencing technologies to microbial genetics

Nat Rev Microbiol20097287296

19287448

Mellmann

Harmsen

Cummings

Zentz

Leopold

Rico

Prior

Szczepanowski

Zhang

McLaughlin

Henkhaus

Leopold

Bielaszewska

Prager

Brzoska

Moore

Guenther

Rothberg

Karch

Prospective genomic characterization of the German enterohemorrhagic Escherichia coli O104:H4 outbreak by rapid next generation sequencing technology

PLoS One20116e22751

21799941

Rasko

Webster

Sahl

Bashir

Boisen

Scheutz

Paxinos

Sebra

Chin

Iliopoulos

Klammer

Peluso

Lee

Kislyuk

Bullard

Kasarskis

Wang

Eid

Rank

Redman

Steyert

Frimodt-Møller

Struve

Petersen

Krogfelt

Nataro

Schadt

Waldor

Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany

N Engl J Med2011365709717

21793740

Rohde

Qin

Cui

Loman

Hentschke

Chen

Peng

Zhang

Yang

Zhao

Wang

Guan

Cen

Zhao

Christner

Kobbe

Loos

Yang

Danchin

Gao

Song

Yang

Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4

N Engl J Med2011365718724

21793736

Gardy

Johnston

Ho Sui

Cook

Shah

Brodkin

Rempel

Moore

Zhao

Holt

Varhol

Birol

Lem

Sharma

Elwood

Jones

Brinkman

Brunham

Tang

Whole-genome sequencing and social-network analysis of a tuberculosis outbreak

N Engl J Med2011364730739

21345102

Harris

Feil

Holden

Quail

Nickerson

Chantratita

Gardete

Tavares

Day

Lindsay

Edgeworth

de Lencastre

Parkhill

Peacock

Bentley

Evolution of MRSA during hospital transmission and intercontinental spread

Science2010327469474

20093474

Harris

Cartwright

Török

Holden

Brown

Ogilvy-Stuart

Ellington

Quail

Bentley

Parkhill

Peacock

Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus aureus: a descriptive study

Lancet Infect Dis201313130136

23158674

Roetzer

Diel

Kohl

Rückert

Nübel

Blom

Wirth

Jaenicke

Schuback

Rüsch-Gerdes

Supply

Kalinowski

Niemann

Whole genome sequencing versus traditional genotyping for investigation of a Mycobacterium tuberculosis outbreak: a longitudinal molecular epidemiological study

PLoS Med201310e1001387

23424287

Walker

Harrell

Evans

Kapatai

Dedicoat

Eyre

Wilson

Hawkey

Crook

Parkhill

Harris

Walker

Bowden

Monk

Smith

Peto

Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study

Lancet Infect Dis201313137146

23158499

Loman

Constantinidou

Christner

Rohde

Chan

Quick

Weir

Quince

Smith

Betley

Aepfelbacher

Pallen

A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4

JAMA201330915021510

23571589

Stobbe

Daniels

Espindola

Verma

Melcher

Ochoa-Corona

Garzon

Fletcher

Schneider

E-probe Diagnostic Nucleic acid Analysis (EDNA): a theoretical approach for handling of next generation sequencing data for diagnostics

J Microbiol Methods201394356366

23867249

Stobbe

Schneider

Hoyt

Melcher

Screening metagenomic data for viruses using the e-probe diagnostic nucleic acid assay (EDNA)

Phytopathologyin press

Breeze

Budowle

Schutzer

(Eds): Microbial Forensics2005

Amsterdam: Academic Press

Budowle

Schutzer

Breeze

Keim

Morse

(Eds): Microbial Forensics20112

Amsterdam: Academic Press

Budowle

Schmedes

Murch

The microbial forensics pathway for use of massively-parallel sequencing technologies

The Science and Applications of Microbial Genomics2013

Washington DC: The National Academies Press

117133

Sanger

Nicklen

Coulson

DNA sequencing with chain-terminating inhibitors

Proc Natl Acad Sci U S A19777454635467

271968

Wetterstrand

DNA sequencing costs: data from the NHGRI Large-Scale Genome Sequencing Program2013Available at: https://www.genome.gov/sequencingcosts/

Brenner

Williams

Vermaas

Storck

Moon

McCollum

Mao

Luo

Kirchner

Eletr

DuBridge

Burcham

Albrecht

In vitro cloning of complex mixtures of DNA on microbeads: physical separation of differentially expressed cDNAs Proc Natl Acad Sci U S A20009716651670

Margulies

Egholm

Altman

Attiya

Bader

Bemben

Berka

Braverman

Chen

Dewell

Fierro

Gomes

Godwin

Helgesen

Irzyk

Jando

Alenquer

Jarvie

Jirage

Kim

Knight

Lanza

Leamon

Lefkowitz

Lei

Genome sequencing in microfabricated high-density picolitre reactors

Nature2005437376380

16056220

Merriman

Rothberg

Progress in ion torrent semiconductor chip based sequencing

Electrophoresis20123333973417

23208921

Metzker

Sequencing technologies - the next generation

Nat Rev Genet2010113146

19997069

Quail

Kozarewa

Smith

Scally

Stephens

Durbin

Swerdlow

Turner

A large genome center’s improvements to the Illumina sequencing system

Nat Methods2008510051010

19034268

Shendure

Next-generation DNA sequencing

Nat Biotechnol20082611351145

18846087

Roberts

Carneiro

Schatz

The advantages of SMRT sequencing

Genome Biol201314405

23822731

Flusberg

Webster

Lee

Travers

Olivares

Clark

Korlach

Turner

Direct detection of DNA methylation during single-molecule, real-time sequencing

Nat Methods20107461465

20453866

Eisenstein

Oxford Nanopore announcement sets sequencing sector abuzz

Nat Biotechnol201230295296

22491260

Didelot

Bowden

Wilson

Peto

Crook

Transforming clinical microbiology with bacterial genome sequencing

Nat Rev Genet201213601612

22868263

Budowle

Schutzer

Morse

Martinez

Chakraborty

Marrone

Messenger

Murch

Jackson

Williamson

Harmon

Velsko

Criteria for validation of methods in microbial forensics

Appl Environ Microbiol20087455995607

18658281

Rehm

Bale

Bayrak-Toydemir

Berg

Brown

Deignan

Friez

Funke

Hegde

Lyon

Working Group of the American College of Medical Genetics and Genomics Laboratory Quality Assurance Committee: ACMG clinical laboratory standards for next-generation sequencing

Genet Med201315733747

23887774

Gargis

Kalman

Berry

Bick

Dimmock

Hambuch

Lyon

Voelkerding

Zehnbauer

Agarwala

Bennett

Chen

Chin

Compton

Das

Farkas

Ferber

Funke

Furtado

Ganova-Raeva

Geigenmüller

Gunselman

Hegde

Johnson

Kasarskis

Kulkarni

Lenk

Liu

Manion

Assuring the quality of next-generation sequencing in clinical laboratory practice

Nat Biotechnol20123010331036

23138292

Pont-Kingdon

Gedge

Wooderchak-Donahue

Schrijver

Weck

Kant

Oglesbee

Bayrak-Toydemir

Lyon

Biochemical and Molecular Genetic Resource Committee of the College of American Pathologists: Design and analytical validation of clinical DNA sequencing assays

Arch Pathol Lab Med20121364146

22208486

SWGDAM

Validation guidelines for DNA analysis methods

2012Available at: http://swgdam.org/SWGDAM_Validation_Guidelines_APPROVED_Dec_2012.pdf

Lewis

Hoffmann

Chen

Knight

Bittinger

Hwang

Chen

Berkowsky

Nessel

Bushman

Sampling and pyrosequencing methods for characterizing bacterial communities in the human gut using 16S sequence tags

BMC Microbiol201010206

20673359

Kuczynski

Lauber

Walters

Parfrey

Clemente

Gevers

Knight

Experimental and analytical tools for studying the human microbiome

Nat Rev Genet2012134758

22179717

Daniel

The metagenomics of soil

Nat Rev Microbiol20053470478

15931165

DeLong

Microbial community genomics in the ocean

Nat Rev Microbiol20053459469

15886695

Rusch

Halpern

Sutton

Heidelberg

Williamson

Yooseph

Eisen

Hoffman

Remington

Beeson

Tran

Smith

Baden-Tillson

Stewart

Thorpe

Freeman

Andrews-Pfannkoch

Venter

Kravitz

Heidelberg

Utterback

Rogers

Falcón

Souza

Bonilla-Rosso

Eguiarte

Karl

Sathyendranath

The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific

PLoS Biol20075e77

17355176

Kembel

Jones

Kline

Northcutt

Stenson

Womack

Bohannan

Brown

Green

Architectural design influences the diversity and structure of the built environment microbiome

ISME J2012614691479

22278670

Human Microbiome Project Consortium

Structure, function and diversity of the healthy human microbiome

Nature2012486207214

22699609

Human Microbiome Project Consortium

A framework for human microbiome research

Nature2012486215221

22699610

Thomas

Gilbert

Meyer

Metagenomics - a guide from sampling to data analysis

Microb Inform Exp201223

22587947

Keim

Price

Klevytska

Smith

Schupp

Okinaka

Jackson

Hugh-Jones

Multiple-locus variable-number tandem repeat analysis reveals genetic relationships within Bacillus anthracis

J Bacteriol200018229282936

10781564

Hoffmaster

Fitzgerald

Ribot

Mayer

Popovic

Molecular subtyping of Bacillus anthracis and the 2001 bioterrorism-associated anthrax outbreak, United States

Emerg Infect Dis2002811111116

12396925

Schutzer

Keim

Czerwinski

Budowle

Use of forensic methods under exigent circumstances without full validation

Sci Transl Med200918cm7

Budowle

Schutzer

Burans

Beecher

Cebula

Chakraborty

Cobb

Fletcher

Hale

Harris

Heitkamp

Keller

Kuske

Leclerc

Marrone

McKenna

Morse

Rodriguez

Valentine

Yadev

Quality sample collection, handling, and preservation for an effective microbial forensics program

Appl Environ Microbiol20067264316438

17021190

Ellard

Charlton

Lindsay

Camm

Watson

Abb

Mattocks C2012

Practice Guidelines for Targeted Next Generation Sequencing Analysis and Interpretation. Clinical Molecular GeneticsSociety: Taylor GR

Available at: http://cmgsweb.shared.hosting.zen.co.uk/BPGs/BPG%20for%20targeted%20next%20generation%20sequencing%20final.pdf

Illumina

PCR-free sample preparation kits for whole genome DNA sequencing

2013Available at: http://www.illumina.com/products/truseq-dna-pcr-free-sample-prep-kits.ilmn

Agilent

HaloPlex target enrichment system-ILM

2013Available at: http://www.chem.agilent.com/Library/usermanuals/Public/G9900-90001.pdf

Illumina

Nextera XT DNA sample preparation kit

2013Available at: http://www.illumina.com/products/nextera_xt_dna_sample_prep_kit.ilmn

Thurber

Haynes

Breitbart

Wegley

Rohwer

Laboratory procedures to generate viral metagenomes

Nat Protoc20094470483

19300441

Feehery

Yigit

Oyola

Langhorst

Schmidt

Stewart

Dimalanta

Amaral-Zettler

Davis

Quail

Pradhan

A method for selectively enriching microbial DNA from contaminating vertebrate host DNA

PLoS One20138e76096

24204593

Telenius

Carter

Bebb

Nordenskjöld

Ponder

Tunnacliffe

Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer

Genomics199213718725

1639399

Cheung

Nelson

Whole genome amplification using a degenerate oligonucleotide primer allows hundreds of genotypes to be performed on less than one nanogram of genomic DNA

Proc Natl Acad Sci U S A1996931467614679

8962113

Dean

Hosono

Fang

Faruqi

Bray-Ward

Sun

Zong

Driscoll

Song

Kingsmore

Egholm

Lasken

Comprehensive human genome amplification using multiple displacement amplification

Proc Natl Acad Sci U S A20029952615266

11959976

Syed

Gruenwald

Caruccio

Next-generation sequencing library preparation: simultaneous fragmentation and tagging using in vitro transposition

Nat Methods2009Available at: http://www.nature.com/nmeth/journal/v6/n11/full/nmeth.f.272.html

Knapp

Stiller

Meyer

Generating barcoded libraries for multiplex high-throughput sequencing

Methods Mol Biol2012840155170

22237533

Hamady

Walker

Harris

Gold

Knight

Error-correcting barcoded primers allow hundreds of samples to be pyrosequenced in multiplex

Nat Methods20125235237

18264105

Caporaso

Lauber

Walters

Berg-Lyons

Huntley

Fierer

Owens

Betley

Fraser

Bauer

Gormley

Gilbert

Smith

Knight

Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms

ISME J2012616211624

22402401

Berglund

Kiialainen

Syvänen

Next-generation sequencing technologies and applications for human genetic history and forensics

Investig Genet2011223

Jünemann

Sedlazeck

Prior

Albersmeier

John

Kalinowski

Mellmann

Goesmann

von Haeseler

Stoye

Harmsen

Updating benchtop sequencing performance comparison

Nat Biotechnol201331294296

23563421

Loman

Constantinidou

Chan

Halachev

Sergeant

Penn

Robinson

Pallen

High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity

Nat Rev Microbiol201210599606

22864262

Quail

Smith

Coupland

Otto

Harris

Connor

Bertoni

Swerdlow

A tale of three next generation sequencing platforms: comparison of Ion Torrent

Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics201213341

Lam

Clark

Chen

Natsoulis

O’Huallachain

Dewey

Habegger

Ashley

Gerstein

Butte

Snyder

Performance comparison of whole-genome sequencing platforms

Nat Biotechnol2011307882

22178993

Harismendy

Strausberg

Wang

Stockwell

Beeson

Schork

Murray

Topol

Levy

Frazer

Evaluation of next generation sequencing platforms for population targeted sequencing studies

Genome Biol200910R32

19327155

Ewing

Green

Base-calling of automated sequencer traces using phred

II. Error probabilities. Genome Res19988186194

Hasan

Young

Minard-Smith

McMillan

Isom

Abdullah

Bornman

Faith

Choi

Longmire

Dickens

Cebula

Colwell

Microbial community profiling of human saliva using shotgun metagenomic sequencing

PLoS Onein press

Ames

Hysom

Gardner

Lloyd

Gokhale

Allen

Scalable metagenomic taxonomy classification using a reference genome database

Bioinformatics20132922532260

23828782

Minot

Turner

Ternus

Kadavy

SIANN: Strain identification by alignment to near neighbors2014Available at: http://biorxiv.org/lookup/doi/10.1101/001727

Wood

Salzberg

Kraken: ultrafast metagenomic sequence classification using exact alignments

Genome Biol201415R46

24580807

Keim

Kalif

Schupp

Hill

Travis

Richmond

Adair

Hugh-Jones

Kuske

Jackson

Molecular evolution and diversity in Bacillus anthracis as detected by amplified fragment length polymorphism markers

J Bacteriol1997179818824

9006038

Nocq

Celton

Gendron

Lemieux

Wilhelm

Harnessing virtual machines to simplify next-generation DNA sequencing analysis

Bioinformatics20132920752083

23786767

Board IS

IEEE Standard 610.12 Glossary of software engineering terminology

1990Available at: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?reload=true&punumber=2238

Crumpler

Cheng

Tillman

Benesch

Sawyer

Murray

Press

Snipes

Godziemski

Bergeson

Loreng

General principles of software validation; Final guidance for industry and FDA staff2002Available at: http://www.fda.gov/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm085281.htm

Deorowicz

Grabowski

Compression of DNA sequence reads in FASTQ format

Bioinformatics201127860862

21252073

Bonfield

Mahoney

Compression of FASTQ and SAM format sequencing data

PLoS One20138e59190

23533605

Buckley

Roberts

Reconciling Microbial Systematics and Genomics2007

Washington DC: American Academy of Microbiology

Available at: http://academy.asm.org/index.php/genetics-genomics-molecular-microbiology/454-reconciling-microbial-systematics-and-genomics

Gonzalez-Candelas

Bracho

Wrobel

Moya

Molecular evolution in court analysis of a large hepatitis C virus outbreak from an evolving source

BMC Biol20131176

23870105

Treangen

Koren

Sommer

Liu

Astrovskaya

Ondov

Darling

Phillippy

Pop

MetAMOS: a modular and open source metagenomic assemblyand analysis pipeline

Genome Biol201314R2

23320958

Zerbino

Birney

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

Genome Res200818821829

18349386

Narzisi

Mishra

Comparing de novo genome assembly: the long and short of it

PLoS One20116e19175

21559467

Hernandez

François

Farinelli

Osterås

Schrenzel

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

Genome Res200818802809

18332092

Chaisson

Pevzner

Short read fragment assembly of bacterial genomes

Genome Res200818324330

18083777

Bradnam

Fass

Alexandrov

Baranay

Bechner

Birol

Boisvert

Chapman

Chapuis

Chikhi

Chitsaz

Chou

Corbeil

Del Fabbro

Docking

Durbin

Earl

Emrich

Fedotov

Fonseca

Ganapathy

Gibbs

Gnerre

Godzaridis

Goldstein

Haimel

Hall

Haussler

Hiatt

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Gigascience2013210

23870653

Gurevich

Saveliev

Vyahhi

Tesler

QUAST: quality assessment tool for genome assemblies

Bioinformatics20132910721075

23422339

Hatem

Bozdağ

Toland

Çatalyürek

ÜV

Benchmarking short sequence mapping tools

BMC Bioinformatics201314184

23758764

Langmead

Salzberg

Fast gapped-read alignment with Bowtie 2

Nat Methods20129357359

22388286

Delcher

Phillippy

Carlton

Salzberg

Fast algorithms for large-scale genome alignment and comparison

Nucleic Acids Res20023024782483

12034836

McHardy

Martín

Tsirigos

Hugenholtz

Rigoutsos

Accurate phylogenetic classification of variable-length DNA fragments

Nat Methods200746372

17179938

Patil

Roune

McHardy

The PhyloPythiaS web server for taxonomic assignment of metagenome sequences

PLoS One20127e38581

22745671

Brady

Salzberg

Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models

Nat Methods20096673676

19648916

Rosen

Reichenberger

Rosenfeld

NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads

Bioinformatics201127127129

21062764

Berendzen

Bruno

Cohn

Hengartner

Kuske

McMahon

Wolinsky

Xie

Rapid phylogenetic and functional classification of short genomic fragments with signature peptides

BMC Res Notes20125460

22925230

Huson

Auch

Schuster

MEGAN analysis of metagenomic data

Genome Res200717377386

17255551

Davenport

Neugebauer

Beckmann

Friedrich

Kameri

Kokott

Paetow

Siekmann

Wieding-Drewes

Wienhöfer

Wolf

Tümmler

Ahlers

Sprengel

Genometa–a fast and accurate classifier for short metagenomic shotgun reads

PLoS One20127e41224

22927906

Sharma

Kumar

Prakash

Taylor

Fast and accurate taxonomic assignments of metagenomic sequences using MetaBin

PLoS One20127e34030

22496776

Liu

Gibbons

Ghodsi

Treangen

Pop

Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences

BMC Genomics201112Suppl 2S4

21989143

Segata

Waldron

Ballarini

Narasimhan

Jousson

Huttenhower

Metagenomic microbial community profiling using unique clade-specific marker genes

Nat Methods20129811814

22688413

Haft

Tovchigrechko

High-speed microbial community profiling

Nat Methods20129793794

22688412

Krause

Diaz

Goesmann

Kelley

Nattkemper

Rohwer

Edwards

Stoye

Phylogenetic classification of short environmental DNA fragments

Nucleic Acids Res20083622302239

18285365

Gerlach

Jünemann

Tille

Goesmann

Stoye

WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads

BMC Bioinformatics200910430

20021646

Markowitz

Chen

Chu

Szeto

Palaniappan

Grechkin

Ratner

Jacob

Pati

Huntemann

Liolios

Pagani

Anderson

Mavromatis

Ivanova

Kyrpides

IMG/M: the integrated metagenome data management and comparative analysis system

Nucleic Acids Res201240D123D129

22086953

Meyer

Paarmann

D’Souza

Olson

Glass

Kubal

Paczian

Rodriguez

Stevens

Wilke

Wilkening

Edwards

The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes

BMC Bioinformatics20089386

18803844

Wang

Leung

Yiu

Chin

MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species

J Comput Biol201219241249

22300323

MacDonald

Parks

Beiko

Rapid identification of high-confidence taxonomic assignments for metagenomic data

Nucleic Acids Res201240e111

22532608

Brady

Salzberg

PhymmBL expanded: confidence scores, custom databases, parallelization and more

Nat Methods20118367

21527926

Bazinet

Cummings

A comparative evaluation of sequence classification programs

BMC Bioinformatics201213113

22214541

Fierer

Lauber

Zhou

Mcdonald

Costello

Knight

Forensic identification using skin bacterial communities

Proc Natl Acad Sci U S A201010764776481

20231444

Whittaker

Evolution and measurement of species diversity

Taxon197221213251

Fricke

Rasko

Bacterial genome sequencing in the clinic: bioinformatic challenges and solutions

Nat Rev Genet2013154955

24281148