^{a}

^{1}

^{b}

^{c}

^{d}

^{e}

^{a}

^{f}

^{g}

^{a}

^{d}

^{e}

^{h}

^{i}

^{d}

^{a}Department of Biostatistics and Epidemiology, University of Massachusetts-Amherst, Amherst,

^{b}Computer Science Department, Carnegie Mellon University, Pittsburgh,

^{c}Department of Integrative Biology, University of Texas at Austin, Austin,

^{d}Department of Environmental Health Sciences, Columbia University, New York,

^{e}Influenza Division, Centers for Disease Control and Prevention, Atlanta,

^{f}Statistical Sciences Group, Los Alamos National Laboratory, Los Alamos,

^{g}Department of Mathematics and Statistics, Mount Holyoke College, South Hadley,

^{h}Division of Vector-Borne Diseases, Centers for Disease Control and Prevention, San Juan, PR 00920;

^{i}Machine Learning Department, Carnegie Mellon University, Pittsburgh,

^{1}To whom correspondence should be addressed. Email:

Edited by Sebastian Funk, London School of Hygiene & Tropical Medicine, London, United Kingdom, and accepted by Editorial Board Member Diane E. Griffin December 10, 2018 (received for review July 24, 2018)

Author contributions: N.G.R., L.C.B., S.J.F., S.K., C.J.M., E.M., D.O., E.L.R., A.T., T.K.Y., M.B., M.A.J., R.R., and J.S. designed research; N.G.R., L.C.B., S.J.F., S.K., C.J.M., E.M., D.O., E.L.R., A.T., and T.K.Y. performed research; N.G.R., C.J.M., and E.M. analyzed data; and N.G.R. wrote the paper.

Accurate prediction of the size and timing of infectious disease outbreaks could help public health officials in planning an appropriate response. This paper compares approaches developed by five different research groups to forecast seasonal influenza outbreaks in real time in the United States. Many of the models show more accurate forecasts than a historical baseline. A major impediment to predictive ability was the real-time accuracy of available data. The field of infectious disease forecasting is in its infancy and we expect that innovation will spur improvements in forecasting in the coming years.

Influenza infects an estimated 9–35 million individuals each year in the United States and is a contributing cause for between 12,000 and 56,000 deaths annually. Seasonal outbreaks of influenza are common in temperate regions of the world, with highest incidence typically occurring in colder and drier months of the year. Real-time forecasts of influenza transmission can inform public health response to outbreaks. We present the results of a multiinstitution collaborative effort to standardize the collection and evaluation of forecasting models for influenza in the United States for the 2010/2011 through 2016/2017 influenza seasons. For these seven seasons, we assembled weekly real-time forecasts of seven targets of public health interest from 22 different models. We compared forecast accuracy of each model relative to a historical baseline seasonal average. Across all regions of the United States, over half of the models showed consistently better performance than the historical baseline when forecasting incidence of influenza-like illness 1 wk, 2 wk, and 3 wk ahead of available data and when forecasting the timing and magnitude of the seasonal peak. In some regions, delays in data reporting were strongly and negatively associated with forecast accuracy. More timely reporting and an improved overall accessibility to novel and traditional data sources are needed to improve forecasting accuracy and its integration with real-time public health decision making.

Over the past 15 y, the number of published research articles on forecasting infectious diseases has tripled (Web of Science,

Forecasts of infectious disease transmission can inform public health response to outbreaks. Accurate forecasts of the timing and spatial spread of infectious disease incidence can provide valuable information about where public health interventions can be targeted (

While multimodel comparisons exist in the literature for single-outbreak performance (

Influenza is a respiratory viral infection that can cause mild or severe symptoms. In the United States each year, influenza viruses infect an estimated 9–35 million individuals and cause between 12,000 and 56,000 deaths (

Starting in the 2013/2014 influenza season, the US Centers for Disease Control and Prevention (CDC) has run the “Forecast the Influenza Season Collaborative Challenge” (a.k.a. FluSight) each influenza season, soliciting prospective, real-time weekly forecasts of regional-level weighted influenza-like illness (wILI) measures from teams across the world (

(

Building on the structure of the FluSight challenges [and those of other collaborative forecasting efforts (

List of models, with key characteristics

Team | Model abbreviation | Model description | Ref. | Ext. data | Mech. model | Ens. model |

CU | EAKFC_SEIRS | Ensemble adjustment Kalman filter SEIRS | ( | x | x | |

EAKFC_SIRS | Ensemble adjustment Kalman filter SIRS | ( | x | x | ||

EKF_SEIRS | Ensemble Kalman filter SEIRS | ( | x | x | ||

EKF_SIRS | Ensemble Kalman filter SIRS | ( | x | x | ||

RHF_SEIRS | Rank histogram filter SEIRS | ( | x | x | ||

RHF_SIRS | Rank histogram filter SIRS | ( | x | x | ||

BMA | Bayesian model averaging | ( | ||||

Delphi | BasisRegression | Basis regression, epiforecast defaults | ( | |||

DeltaDensity1 | Delta density, epiforecast defaults | ( | ||||

EmpiricalBayes1 | Empirical Bayes, conditioning on past 4 wk | ( | ||||

EmpiricalBayes2 | Empirical Bayes, epiforecast defaults | ( | ||||

EmpiricalFuture | Empirical futures, epiforecast defaults | ( | ||||

EmpiricalTraj | Empirical trajectories, epiforecast defaults | ( | ||||

DeltaDensity2 | Markovian Delta density, epiforecast defaults | ( | ||||

Uniform | Uniform distribution | |||||

Stat | Ensemble, combination of 8 Delphi models | ( | x | |||

LANL | DBM | Dynamic Bayesian SIR model with discrepancy | ( | x | ||

ReichLab | KCDE | Kernel conditional density estimation | ( | |||

KDE | Kernel density estimation and penalized splines | ( | ||||

SARIMA1 | SARIMA model without seasonal differencing | ( | ||||

SARIMA2 | SARIMA model with seasonal differencing | ( | ||||

UTAustin | EDM | Empirical dynamic model or method of analogues | ( |

Team abbreviations: CU, Columbia University; Delphi, Carnegie Mellon; LANL, Los Alamos National Laboratories; ReichLab, University of Massachusetts-Amherst; SEIRS, Suceptible-Exposed-Infectious-Recovered-Susceptible, and SIRS, Suceptible-Infectious-Recovered-Susceptible, compartmental models of infectious disease transmission; UTAustin, University of Texas at Austin. The “Ext. data” column notes models that use data external to the ILINet data from CDC. The “Mech. model” column notes models that rely to some extent on a mechanistic or compartmental model formulation. The “Ens. model” column notes models that are ensemble models.

*Note that some of these components were not designed as standalone models, so their performance may not reflect the full potential of the method’s accuracy (

In addition to analyzing comparative model performance over multiple seasons, this work identifies key bottlenecks that limit the accuracy and generalizability of current forecasting efforts. Specifically, we present quantitative analyses of the impact that incomplete or partial case reporting has on forecast accuracy. Additionally, we assess whether purely statistical models show similar performance to that of models that consider explicit mechanistic models of disease transmission. Overall, this work shows strong evidence that carefully crafted forecasting models for region-level influenza in the United States consistently outperformed a historical baseline model for targets of particular public health interest.

Influenza forecasts have been evaluated by the CDC primarily using a variation of the log score, a measure that evaluates both the precision and accuracy of a forecast (

Average scores for all of the short-term forecasts (1- through 4-wk-ahead targets) varied substantially across models and regions (

Average forecast score by model region and target type, averaged over weeks and seasons. The text within the grid shows the score itself. The white midpoint of the color scale is set to be the target- and region-specific average of the historical baseline model, ReichLab-KDE, with darker blue colors representing models that have better scores than the baseline and darker red scores representing models that have worse scores than the baseline. The models are sorted in descending order from most accurate (top) to least accurate (bottom) and regions are sorted from high scores (right) to low scores (left).

Models were more consistently able to forecast week-ahead wILI in some regions than in others. Predictability for a target can be broken down into two components. First, What is the baseline score that a model derived solely from historical averages can achieve? Second, by using alternate modeling approaches, How much more accuracy can be achieved beyond this historical baseline? Looking at results across all models, HHS region 1 was the most predictable and HHS region 6 was the least predictable (

The models presented show substantial improvements in accuracy compared with forecasts from the historical baseline model in all regions of the United States. Results that follow are based on summaries from those models that on average showed higher forecast score than the historical baseline model. HHS region 1 showed the best overall week-ahead predictability of any region. Here, the models showed an average forecast score of 0.54 for week-ahead targets (

Absolute and relative forecast performance for week-ahead (

Forecast score declined as the target moved farther into the future relative to the most recent observation. Over half of the models outperformed the historical baseline model in making 1-wk-ahead forecasts, as 15 of 22 models outperformed the historical baseline in at least six of the seven seasons. However, only 7 of 22 models outperformed the historical baseline in at least six seasons when making 4-wk-ahead forecasts. For the model with highest forecast score across all 4-wk-ahead targets (CU-EKF_SIRS), the average scores across regions and seasons for 1- through 4-wk-ahead forecasts were 0.55, 0.44, 0.36, and 0.31. This mirrored an overall decline in score observed across most models. Only in HHS region 1 were the forecast scores from the CU-EKF_SIRS model for both the “nowcast” targets (1 and 2 wk ahead) above 0.5.

Overall, forecast score was lower for seasonal targets than for week-ahead targets, although the models showed greater relative improvement compared with the baseline model (

Of the three seasonal targets, models showed the lowest average score in forecasting season onset, with an overall average score of 0.15. Due to the variable timing of season onset, different numbers of weeks were included in the final scoring for each region–season (

Accuracy in forecasting season onset was also impacted by revisions to wILI data. In some region–seasons current data led models to be highly confident that onset had occurred in one week, only to have revised data later in the season change the week that was considered to be the onset. One good example of this is HHS region 2 in 2015/2016. Here, data in early 2016 showed season onset to be epidemic week 2 (EW2) of 2016. Revisions to the data around EW12 led the models to identify EW51 as the onset. A further revision, occurring in EW21 of 2016, showed the onset actually occurred on EW4 of 2016. Networked metapopulation models that take advantage of observed activity in one location to inform forecasts of other locations have shown promise for improving forecasts of season onset (

Models showed an overall average score of 0.23 in forecasting peak week. The best model for peak week was ReichLab-KCDE (

Models showed an overall average score of 0.20 in forecasting peak intensity. The best model for peak intensity was LANL-DBM, with overall average score of 0.38. Region- and season-specific forecast scores from this model for peak intensity ranged from 0.13 to 0.61. The historical baseline model showed 0.13 score in forecasting peak intensity. Overall, 12 of 22 models (55%) had better overall score in at least six of the seven seasons evaluated (

While models for peak week and peak percentage converged on the observed values after the peak occurred, before the peak occurring all models showed substantial uncertainty (

Average forecast score by model and week relative to peak. Scores for each location–season were aligned to summarize average performance relative to the peak week on the

Averaging across all targets and locations, forecast scores varied widely by model and season (

Average forecast score, aggregated across targets, regions, and weeks, plotted separately for each model and season. Models are sorted from lowest scores (left) to highest scores (right). Higher scores indicate better performance. Circles show average scores across all targets, regions, and weeks within a given season. The “x” marks the geometric mean of the seven seasons. The names of compartmental models are shown in boldface type. The ReichLab-KDE model (red italics) is considered the historical baseline model.

The six top-performing models used a range of methodologies, highlighting that very different approaches can result in very similar overall performance. The overall best model was an ensemble model (Delphi-Stat) that used a weighted combination of other models from the Delphi group. Both the ReichLab-KCDE and the Delphi-DeltaDensity1 (

On the whole, statistical models achieved similar or slightly higher scores to those of compartmental models when forecasting both week-ahead and seasonal targets, although the differences were small and of minimal practical significance. Using the best three overall models from each category, we computed the average forecast score for each combination of region, season, and target (

Comparison of the top three statistical models (Delphi-DeltaDensity1, ReichLab-KCDE, ReichLab-SARIMA2) and the top three compartmental models, (LANL-DBM, CU-EKF_SIRS, CU-RHF_SIRS) (

Score | |||

Statistical | Compartmental | ||

Target | model | model | Difference |

1 wk ahead | 0.49 | 0.43 | 0.06 |

2 wk ahead | 0.40 | 0.41 | −0.01 |

3 wk ahead | 0.35 | 0.34 | 0.00 |

4 wk ahead | 0.32 | 0.30 | 0.02 |

Season onset | 0.23 | 0.22 | 0.01 |

Season peak percentage | 0.32 | 0.27 | 0.05 |

Season peak week | 0.34 | 0.32 | 0.02 |

The difference column represents the difference in the average probability assigned to the eventual outcome for the target in each row. Positive values indicate the top statistical models showed higher average score than the top compartmental models.

In the seven seasons examined in this study, wILI percentages were often revised after first being reported. The frequency and magnitude of revisions varied by region, and the majority of initial values (nearly 90%) are within ±0.5% of the final observed value. For example, in HHS region 9, over 51% of initially reported wILI values ended up being revised by over 0.5 percentage points while in HHS region 5 less than 1% of values were revised that much. Across all regions, 10% of observations were ultimately revised by more than 0.5 percentage points.

When the first report of the wILI measurement for a given region–week was revised in subsequent weeks, we observed a corresponding strong negative impact on forecast accuracy. Larger revisions to the initially reported data were strongly associated with a decrease in the forecast score for the forecasts made using the initial, unrevised data. Specifically, among the four top-performing nonensemble models (ReichLab-KCDE, LANL-DBM, Delphi-DeltaDensity1, and CU-EKF_SIRS), there was an average change in forecast score of −0.29 (95% CI: −0.39, −0.19) when the first observed wILI measurement was between 2.5 and 3.5 percentage points lower than the final observed value, adjusting for model, week of year, and target (

Model-estimated changes in forecast skill due to bias in initial reports of wILI %. Shown are estimated coefficient values (and 95% confidence intervals) from a multivariable linear regression using model, week of year, target, and a categorized version of the bias in the first reported wILI % to predict forecast score. The

This work presents a large-scale comparison of real-time forecasting models from different modeling teams across multiple years. With the rapid increase in infectious disease forecasting efforts, it can be difficult to understand the relative importance of different methodological advances in the absence of an agreed-upon set of standard evaluations. We have built on the foundational work of CDC efforts to establish and evaluate models against a set of shared benchmarks which other models can use for comparison. Our collaborative, team science approach highlights the ability of multiple research groups working together to uncover patterns and trends of model performance that are harder to observe in single-team studies.

Seasonal influenza in the United States, given the relative accessibility of historical surveillance data and recent history of coordinated forecasting “challenges,” is an important testbed system for understanding the current state of the art of infectious disease forecasting models. Using models from some of the most experienced forecasting teams in the country, this work reveals several key results about forecasting seasonal influenza in the United States: A majority of models consistently showed higher accuracy than historical baseline forecasts, both in regions with more predictable seasonal trends and in those with less consistent seasonal patterns (

As knowledge and data about a given infectious disease system improve and become more granular, a common question among domain-area experts is whether mechanistic models will outperform more statistical approaches. However, the statistical vs. mechanistic model dichotomy is not always a clean distinction in practice. In the case of influenza, mechanistic models simulate a specific disease transmission process governed by the assumed parameters and structure of the model. But observed “influenza-like illness” data are driven by many factors that have little to do with influenza transmission (e.g., clinical visitation behaviors, the symptomatic diagnosis process, the case-reporting process, a data-revision process, etc.). Since ILI data represent an impure measure of actual influenza transmission, purely mechanistic models may be at a disadvantage in comparison with more structurally flexible statistical approaches when attempting to model and forecast ILI. To counteract this potential limitation of mechanistic models in modeling noisy surveillance data, many forecasting models that have a mechanistic core also use statistical approaches that explicitly or implicitly account for unexplained discrepancies from the underlying model (

There are several important limitations to this work as presented. While we have assembled and analyzed a range of models from experienced influenza-forecasting teams, there are large gaps in the types of data and models represented in our library of models. For example, relatively few additional data sources have been incorporated into these models, no models are included that explicitly incorporate information about circulating strains of influenza, and no model explicitly includes spatial relationships between regions. Given that several of the models rely on similar modeling frameworks, adding a more diverse set of modeling approaches would be a valuable contribution. Additionally, while seven seasons of forecasts from 22 models is the largest study we know of that compares models from multiple teams, this remains a less-than-ideal sample size to draw strong conclusions about model performance. Since each season represents a set of highly correlated dynamics across regions, few data are available from which to draw strong conclusions about comparative model performance. Finally, these results should not be used to extrapolate hypothetical accuracy in pandemic settings, as these models were optimized specifically to forecast seasonal influenza.

What is the future of influenza forecasting in the United States and globally? While long-run forecast accuracy for influenza will vary based on a variety of factors [including, e.g., data quality, the geographical scale of forecasts, population density of forecasted areas, and consistency of weather patterns over time (

To advance infectious disease forecasting broadly, a complete enumeration and understanding of the challenges facing the field are critical. In this work, we have identified and quantified some of these challenges, specifically focusing on timely reporting of surveillance data. However, other barriers may be of equal or greater importance to continued improvement of forecasts. Often, researchers either lack access to or do not know how best to make use of novel data streams (e.g., Internet data, electronic medical health record data). Increased methodological innovation in models that merge together an understanding of biological drivers of disease transmission (e.g., strain-specific dynamics and vaccination effectiveness) with statistical approaches to combine data hierarchically at different spatial and temporal scales will be critical to moving this field forward. From a technological perspective, additional efforts to standardize data collection, format, storage, and access will increase interoperability between groups with different modeling expertise, improve accessibility of novel data streams, and continue to provide critical benchmarks and standards for the field. Continuing to refine forecasting targets to more closely align with public health activities will improve integration of forecasts with decision making. Recent work from the CDC has developed standardized algorithms to classify the severity of influenza seasons (

Public health officials are still learning how to best integrate forecasts into real-time decision making. Close collaboration between public health policymakers and quantitative modelers is necessary to ensure that forecasts have maximum impact and are appropriately communicated to the public and the broader public health community. Real-time implementation and testing of forecasting methods play a central role in planning and assessing what targets should be forecasted for maximum public health impact.

Detailed methodology and results from previous FluSight challenges have been published (

During each influenza season, the wILI data are updated each week by the CDC. When the most recent data are released, the prior weeks’ reported wILI data may also be revised. The unrevised data, available at a particular moment in time, are available via the DELPHI real-time epidemiological data API beginning in the 2014/2015 season (

The FluSight challenges have defined seven forecasting targets of particular public health relevance. Three of these targets are fixed scalar values for a particular season: onset week, peak week, and peak intensity (i.e., the maximum observed wILI percentage). The remaining four targets are the observed wILI percentages in each of the subsequent 4 wk (

The FluSight challenges have also required that all forecast submissions follow a particular format. A single submission file (a comma-separated text file) contains the forecast made for a particular EW of a season. Standard CDC definitions of EW are used (

To be included in the model comparison presented here, previous participants in the CDC FluSight challenge were invited to provide out-of-sample forecasts for the 2010/2011 through 2016/2017 seasons. For each season, files were submitted for EW40 of the first calendar year of the season through EW20 of the following calendar year. (For seasons that contained an EW53, an additional file labeled EW53 was included.) For each model, this involved creating 233 separate forecast submission files, one for each of the weeks in the seven training seasons. In total, the forecasts represent over 40 million rows and 2.5 GB of data. Each forecast file represented a single submission file, as would be submitted to the CDC challenge. Each team created submitted forecasts in a prospective, out-of-sample fashion, i.e., fitting or training the model only on data available before the time of the forecast (

Five teams each submitted between one and nine separate models for evaluation (

Three models stand out as being reference models. One shared feature of these models is that their forecasts do not depend on observed data from the season being forecasted. The Delphi-Uniform model always provides a forecast that assigns equal probability to all possible outcomes. The ReichLab-KDE model yields predictive distributions based entirely on data from other seasons using kernel density estimation (KDE) for seasonal targets and a generalized additive model with cyclic penalized splines for weekly incidence. The Delphi-EmpiricalTraj model uses KDE for all targets. The “historical baseline” model named throughout this paper refers to the ReichLab-KDE model. Because this model represents a prediction that essentially summarizes historical data, we consider this model an appropriate baseline model to reflect historical trends.

We note that some of the models presented here were developed as standalone forecasting models whereas others were developed as components of a larger ensemble system. We define a standalone model as one that is rigorously validated to show optimal performance on its own. Component models could also be optimized, although they could also be developed solely to provide a specific or supplemental signal as part of a larger system. All of the Delphi group’s models except for Delphi-Stat were developed as components rather than standalone models. Despite this, some of the Delphi models, in particular, Delphi-DeltaDensity1, performed quite well relative to other standalone models. Component models can also provide useful baselines for comparison, e.g., the Delphi-Uniform model, which assigns uniform probability to all possible outcomes, and the Delphi-EmpiricalTraj model, which creates a seasonal average model that is not updated based on current data.

Once submitted to the central repository, the models were not updated or modified except in four cases to fix explicit bugs in the code that yielded numerical problems with the forecasts. (In all cases, the updates did not substantially change the performance of the updated models.) Refitting of models or tuning of model parameters was explicitly discouraged to avoid unintentional overfitting of models.

The log score for a model

Following CDC FluSight evaluation procedures, we computed modified log scores for the targets on the wILI percentage scale such that predictions within

Average log scores can be used to compare models’ performance in forecasting for different locations, seasons, targets, or times of season. In practice, each model

While log scores are not on a particularly interpretable scale, a simple transformation enhances interpretability substantially. Exponentiating an average log score yields a forecast score equivalent to the geometric mean of the probabilities assigned to the eventually observed outcome (or, more specifically for the modified log score, to regions of the distribution eventually considered accurate). The geometric mean is an alternative measure of central tendency to an arithmetic mean, representing the

Following the convention of the CDC challenges, we included only certain weeks in the calculation of the average log scores for each target. This focuses model evaluation on periods of time that are more relevant for public health decision making. Forecasts of season onset are evaluated based on the forecasts that are received up to 6 wk after the observed onset week within a given region. Peak week and peak intensity forecasts were scored for all weeks in a specific region–season up until the wILI measure drops below the regional baseline level for the final time. Week-ahead forecasts are evaluated using forecasts received 4 wk before the onset week through forecasts received 3 wk after the wILI goes below the regional baseline for the final time. In a region–season without an onset, all weeks are scored. To ensure all calculated summary measures would be finite, all log scores with values of less than −10 were assigned the value −10, following CDC scoring conventions. This rule was invoked for 2,648 scores or 0.8% of all scores that fell within the scoring period. All scores were based on “ground truth” values of wILI data obtained as of September 27, 2017.

The CDC publicly releases data on doctor’s office visits due to ILI each week. These data, especially for the most recent weeks, are occasionally revised, due to new or updated data being reported to the CDC since their last publication. While often these revisions are fairly minor or nonexistent, at other times these revisions can be substantial, changing the reported wILI value by over 50% of the originally reported value. Since the unrevised data are used by forecasters to generate current forecasts, real-time forecasts can be biased by the initially reported, preliminary data.

We used a regression model to analyze the impact of these unrevised reports on forecasting. Specifically, for each region and EW we calculated the difference between the first and the last reported wILI values for each EW for which forecasts were generated in the seven seasons under consideration. We then created a categorical variable (

There is not a consensus on a single best modeling approach or method for forecasting the dynamic patterns of infectious disease outbreaks in both endemic and emergent settings. Semantically, modelers and forecasters often use a dichotomy of mechanistic vs. statistical (or “phenomenological”) models to represent two different philosophical approaches to modeling. Mechanistic models for infectious disease consider the biological underpinnings of disease transmission and in practice are implemented as variations on the susceptible–infectious–recovered (SIR) model. Statistical models largely ignore the biological underpinnings and theory of disease transmission and focus instead on using data-driven, empirical, and statistical approaches to make the best forecasts possible of a given dataset or phenomenon.

However, in practice, this dichotomy is less clear than it is in theory. For example, statistical models for infectious disease counts may have an autoregressive term for incidence (e.g., as done by the ReichLab-SARIMA1 model). This could be interpreted as representing a transmission process from one time period to another. In another example, the LANL-DBM model has an explicit SIR compartmental model component but also uses a purely statistical model for the discrepancy of the compartmental model with observed trends. The models from Columbia University used a statistical nowcasting approach for their 1-wk-ahead forecasts, but after that relied on different variations of a SIR model.

We categorized models according to whether or not they had any explicit compartmental framework (

To maximize the reproducibility and data availability for this project, the data and code for the entire project (excluding specific model code) are publicly available. The project is available on GitHub (

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention, Defense Advanced Research Projects Agency, Defense Threat Reduction Agency, the National Institutes of Health, National Science Foundation, or Uptake Technologies.

Conflict of interest statement: J.S. and Columbia University disclose partial ownership of SK Analytics.

This article is a PNAS Direct Submission. S.F. is a guest editor invited by the Editorial Board.

Data deposition: The data and code for this analysis have been deposited in GitHub,

See Commentary on page