With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatiotemporal process models have become widely deployed statistical tools for researchers to better understand the complex nature of spatial and temporal variability. However, fitting hierarchical spatiotemporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. This article offers a focused review of two methods for constructing well-defined highly scalable spatiotemporal stochastic processes. Both these processes can be used as “priors” for spatiotemporal random fields. The first approach constructs a low-rank process operating on a lower-dimensional subspace. The second approach constructs a Nearest-Neighbor Gaussian Process (NNGP) that ensures sparse precision matrices for its finite realizations. Both processes can be exploited as a scalable prior embedded within a rich hierarchical modeling framework to deliver full Bayesian inference. These approaches can be described as model-based solutions for big spatiotemporal datasets. The models ensure that the algorithmic complexity has ~

The increased availability of inexpensive, high speed computing has enabled the collection of massive amounts of spatial and spatiotemporal datasets across many fields. This has resulted in widespread deployment of sophisticated Geographic Information Systems (GIS) and related software, and the ability to investigate challenging inferential questions related to geographically-referenced data. See, for example, the books by

This article will focus only on point-referenced data, which refers to data referenced by points with coordinates (latitude-longitude, Easting-Northing etc.). Modeling typically proceeds from a spatial or spatiotemporal process that introduces dependence among any finite collection of random variables from an underlying random field. For our purposes, we will consider the stochastic process as an uncountable set of random variables, say {^{d}^{d}

Such processes are specified with a _{θ}_{1}, ℓ_{2}, …, ℓ_{n}_{𝒰} = (_{1}), _{2}), …, _{n}^{⊤} be the realizations of the process over 𝒰. Also, for two finite sets 𝒰 and 𝒱 containing _{θ}_{𝒰}, _{𝒱} | _{θ}_{θ}_{θ}_{θ}_{θ}_{θ}_{1}), _{2}) …, _{n}^{⊤} is distributed as _{θ}_{θ}_{θ}_{i},_{j}_{θ}_{θ}

Spatial and spatiotemporal processes are conveniently embedded within Bayesian hierarchical models. The most common geostatistical setting assumes a response or dependent variable

_{θ}

_{1})_{2}), …, _{n}^{⊤} is the ^{⊤} (ℓ_{i}_{τ}^{2}_{n}^{2} is called the “nugget.” The hierarchy is completed by assigning prior distributions to

Bayesian inference can proceed by sampling from the joint posterior density in (_{θ}_{θ}^{3} floating point operations (flops). Memory requirements are of the order ~ ^{2}. These become prohibitive for large values of _{θ}^{2}_{θ}^{2}_{n}

As modern data technologies are acquiring and exploiting massive amounts of spatiotemporal data, modeling and inference for large spatiotemporal datasets are receiving increased attention. In fact, it is impossible to provide a comprehensive review of all existing methods for geostatistical models for massive spatial data sets; ^{−1} itself with an exploitable structure so that the solution ^{−1}

We remark that when inferring about stochastic processes, it is also possible to work in the spectral domain. This rich, and theoretically attractive, option has been advocated by

Broadly speaking, model-based approaches for large spatial datasets proceeds from either exploiting “low-rank” models or exploiting “sparsity”. The former attempts to construct Gaussian processes on a lower-dimensional subspace (see, e.g.,

This article aims to provide a focused review of some massively scalable Bayesian hierarchical models for spatiotemporal data. The aim is not to provide a comprehensive review of all existing methods. Instead, we focus upon two fully model-based approaches that can be easily embedded within hierarchical models and deliver full Bayesian inference. These are low-rank processes and sparsity-inducing processes. Both these processes can be used as “priors” for spatiotemporal random fields. Here is a brief outline of the paper. Section 2 discusses a Bayesian hierarchical framework for low-rank models and their implementation. Section 3 discusses some recent developments in sparsity-inducing Gaussian processes, especially nearest-neighbor Gaussian processes, and their implementation. Finally, Section 4 provides a brief account of outstanding issues for future research.

A popular way of dealing with large spatial datasets is to devise models that bring about dimension reduction (

_{θ}_{θ}_{1})_{2}), …, _{n}^{⊤} is represented as _{θ}z_{θ}_{θ}_{θ}

_{z}_{θ}

Different families of spatial models emerge from different specifications for the process _{θ}^{2}_{n}^{2}_{θ}^{⊤}
_{θ}_{θ}_{θ}

Some choices of basis functions can be more computationally efficient than others depending upon the specific application. For example, _{θ}_{θ}

A different approach is to specify the

The idea underlying low-rank dimension reduction is not dissimilar to Bayesian linear regression. For example, consider a simplified version of the hierarchical model in (_{θ}z

_{τ}_{z}_{θ}_{θ}z_{τ}

The above formula reveals dimension reduction in terms of the marginal covariance matrix for _{τ}

_{τ}

In practical Bayesian computations, however, it is less efficient to directly use the formulas in (

_{z}_{τ}_{τ}_{τ}_{z}_{*} | θ, τ_{*}_{*}ẑ_{*}_{*}

Irrespective of the precise specifications, low-rank models tend to underestimate uncertainty (since they are driven by a finite number of random variables), hence, overestimate the residual variance (i.e., the nugget). Put differently, this arises from systemic over-smoothing or model under-specification by the low-rank model when compared to the parent model. For example, if

This phenomenon, in fact, is not dissimilar to what is seen in linear regression models and is especially transparent from writing the parent likelihood and low-rank likelihood as mixed linear models. To elucidate, suppose, without much loss of generality, that 𝒰 is a set with ^{2}_{θ}_{θ}_{1}_{2}, …, _{n}^{⊤} is now an _{1} : _{2}], where _{1} has _{1}_{1}^{2}_{1} is an ^{⊤} (_{B}^{⊤} (_{B}_{1} )_{A}_{B}_{B}_{1} +_{[(I−PB1)B2]}, which is a standard result in linear model theory, we find the excess residual variability in the low-rank likelihood is summarized by ^{⊤}
_{[(I−PB1)B2]}

In practical data analysis, the above phenomenon is usually manifested by an over-estimation of the nugget variance as it absorbs the residual variation from the low-rank approximation. Consider the following simple experiment. We simulated a spatial dataset using the spatial regression model in (_{1}_{2}, …, ℓ_{n}^{2} = 5, _{θ}_{θ}_{i}_{j}^{2} exp(−_{i}_{j}||^{2} = 5 and ^{2}_{n}_{×}
_{n}_{r}_{×}
_{r}_{θ}_{i},^{*}^{2}. Even with

Although this excess residual variability can be quantified as above (for any given value of the covariance parameters _{θ}^{2}. The difference _{KC}

One particular class of low-rank processes have been especially useful in providing easy tractability to the residual process. Let _{θ}^{*} be the ^{*} of

This single site interpolator, in fact, is a well-defined process _{θ}^{*}. The process is completely specified given the covariance function of the parent process and the set of knots, 𝒰^{*}. The corresponding basis functions in (_{θ}

Exploiting elementary properties of conditional expectations, we obtain

Therefore, var{_{η}_{,}_{θ}^{2}(ℓ).

Perhaps the simplest way to remedy the bias in the predictive process is to approximate the residual process

_{ε}

We present a brief simulation example revealing the benefits of the modified predictive process. We generate 2000 locations within a [0, 100] × [0, 100] square and then generate the outcomes from (^{2} = 1 for the spatial process, and with nugget variance ^{2} = 1. We then fit the predictive process and modified predictive process models derived from (^{2} by the predictive process and that the modified predictive process is able to adjust for the ^{2}. Not surprisingly, the RMSPE is essentially the same under either process model.

Further enhancements to the modified predictive process are possible. Since the modified predictive process adjusts only the variance, information in the covariance induced by the residual process _{tap}_{η}_{,}_{θ}_{tap}_{,}_{ν}_{η}_{,}_{θ}_{tap}_{,}_{ν}

Perhaps the most promising use of the predictive process, at least in terms of scalability to massive spatial datasets, is the recent multiresolution approximation proposed by _{1},ℒ_{2},…, ℒ_{J}_{j}_{1}(ℓ) such that Cov{_{1}(ℓ), _{1}(ℓ′)} = 0 if ℓ and ℓ′ are not in the same subregion and is equal to (

At the second resolution, each ℒ_{j}_{j}_{1},ℒ_{j}_{2},…, ℒ_{jm}_{1}(ℓ) ≈ _{1}(ℓ)+_{2}(ℓ), where _{1}(ℓ) is the predictive process derived from _{1}(ℓ) using the knots in ℒ_{j}_{j}_{2}(ℓ) is the analogous block-independent approximation across the subregions within each ℒ_{j}_{2}(ℓ), _{2}(ℓ′)} = 0 if ℓ and ℓ′ are not in the same level-2 subregion and will equal Cov{_{1}(ℓ), _{1}(ℓ′)} when ℓ and ℓ′ are in the same level-2 subregion. At resolution 3 we partition each of the level-2 subregions into level-3 subregions and continue the approximation of the residual process from the predictive process. At the end of ^{2}^{2}), where

To summarize, we do not recommend the use of

A very rich and flexible class of spatial and spatiotemporal models emerge from the hierarchical linear mixed model

_{u,θ}_{τ}_{θ}_{θ}

Bayesian inference proceeds, customarily, by sampling {_{β}, V_{β}_{y | θ,τ}_{y | θ,τ}_{τ}_{θ}_{z,u}^{2}) flops. Using the Sherman–Woodbury–Morrison formula in (_{y | θ,τ}

The primary computational bottleneck lies in evaluating the multivariate Gaussian likelihood _{y | θ,τ}^{⊤}, where

^{−1}^{/}^{2}_{θ}^{⊤}). Having obtained _{1} = ^{−1}^{/}^{2}_{2} = _{1}, and obtain _{r}^{⊤}). The log-target density for {

_{ii}_{ii}_{τ}^{2} + ^{3}) ≈ ^{2}) (since ^{3}) flops that would have been required for the analogous computations in a full Gaussian process model. In practice, Gaussian proposal distributions are employed for the Metropolis algorithm and all parameters with positive support are transformed to their logarithmic scale. Therefore, the necessary Jacobian adjustments are made to (

Starting with initial values for all parameters, each iteration of the MCMC executes the above calculations to provide a sample for {^{−1}^{/}^{2}[^{⊤}

We repeat the above computations for each iteration of the MCMC algorithm using the current values of the process parameters in Σ_{y}^{3}) and not as expensive. However, it will involve the inverse of _{z,θ}_{z,θ}_{z,θ}^{−1}_{z,θ}^{⊤}^{*} ~ N_{r}^{*}^{⊤}

Finally, we seek predictive inference for _{0}) at any arbitrary space-time coordinate ℓ_{0}. Given ^{⊤}(ℓ_{0}), we draw
_{0}) and _{0}). Posterior predictive samples of the latent processes can also be easily computed as
_{i}

Low-rank models have been, and continue to be, widely employed for analyzing spatial and spatiotemporal data. The algorithmic cost for fitting low-rank models typically decrease from ^{3}) to ^{2} + ^{3}) ≈ ^{2}) flops since ^{2} flops become exorbitant. Furthermore, low-rank models can perform poorly depending upon the smoothness of the underlying process or when neighboring observations are strongly correlated and the spatial signal dominates the noise (

As an example, consider part of the simulation experiment presented in

^{4} to ~ 10^{6} locations, low-rank models may struggle to deliver acceptable inference. In this regard, enhancements such as the multi-resolution predictive process approximations referred to in Section 2.2 are highly promising.

An alternative is to develop full rank models that can exploit sparsity. Instead of deriving basis approximations for _{θ}_{tap}_{,ν}_{tap}_{,ν}_{θ}_{tap}_{,ν}_{tap}_{,ν}^{5} or more.

Another way to exploit sparsity is to model the inverse of var{

Rather than working with approximations to the process, one could also construct massively scalable sparsity-inducing Gaussian processes that can be conveniently embedded within Bayesian hierarchical models and deliver full Bayesian inference for random fields at arbitrary resolutions. Section 3.1 describes how sparsity is introduced in the precision matrices for graphical Gaussian models by exploiting the relationship between the Cholesky decomposition of a positive definite matrix and conditional independence. These sparse Gaussian models (i.e., normal distributions with sparse precision matrices) can be used prior models for a finite number of spatial random effects. Section 3.2 shows the construction of a process from these graphical Gaussian models. This process will be a Gaussian process whose finite-dimensional realizations will have sparse precision matrices. We call them Nearest Neighbor Gaussian Processes (NNGP). Finally, Section 3.3 outlines how the process can be embedded within hierarchical models and presents some brief simulation examples demonstrating certain aspects of inference from NNGP models.

Consider the hierarchical model (_{θ}_{θ}_{θ}

The underlying idea is, in fact, ubiquitous in graphical models or Bayesian networks (see, e.g., _{i}

The above model posits that any node

Applying the above notion to multivariate Gaussian densities evinces the connection between conditional independence in DAGs and sparsity. Consider an _{θ}_{θ}

_{ij}_{11} = var{_{1}} and _{ii}_{i} | w_{j}

From the structure of _{θ}^{−1}^{−⊤}. The possibly nonzero elements of _{θ}_{θ}

Here
_{θ}_{θ}_{θ}_{θ}

The above pseudocode provides a way to obtain the Cholesky decomposition of _{θ}_{θ}^{⊤} is the Cholesky decomposition, then ^{−1}. There is, however, no apparent gain to be had from the preceding computations since one will need to solve increasingly larger linear systems as the loop runs into higher values of

In (^{3} flops, whereas the earlier pseudocode in (^{3} flops. These computations can be performed in parallel as each iteration of the loop is independent of the others.

The above discussion provides a very useful strategy for introducing sparsity in a precision matrix. Let _{θ}_{θ}_{θ}^{−1}^{−⊤} is a covariance matrix whose inverse
_{θ}_{θ}^{−1}. Therefore, one way to achieve massive scalability for models such as (_{θ}_{θ}

If we are interested in estimating the spatial or spatiotemporal process parameters from a finite collection of random variables, then we can use the approach in Section 3.1 with _{i}_{i}_{i}_{j}_{i}

Localized Gaussian process regression based on few nearest neighbors has also been used to obtain fast kriging estimates.

If, however, posterior predictive inference is sought at arbitrary spatiotemporal resolutions, i.e., for the entire process {

We will construct the NNGP in two steps. First, we specify a multivariate Gaussian distribution over a fixed finite set _{ℛ}
_{θ}_{ℛ} is the _{θ}_{θ}_{θ}

If ^{2}) non-zero elements. Hence, the approximation in (_{θ}

To construct the NNGP we extend the above model to arbitrary locations. We define neighbor sets _{ℛ}, _{ℛ}
_{θ}_{N}_{(ℓ)},

_{i}

Taking conditional expectations in (_{i}

One point worth considering is the definition of “neighbors.” There is some flexibility here. In the spatial setting, the correlation functions usually decay with increasing inter-site distance, so the set of nearest neighbors based on the inter-site distances represents locations exhibiting highest correlation with the given locations. For example, on the plane one could simply use the Euclidean metric to construct neighbor sets, although

In spatiotemporal settings, matters are more complicated. Spatiotemporal covariances between two points typically depend on the spatial as well as the temporal lag between the points. Non-separable isotropic spatiotemporal covariance functions can be written as _{θ}_{1}, _{1}), (_{2}, _{2})) = _{θ}_{1} − _{2}|| and _{1} − _{2}^{2} → ℜ^{+} such that _{θ}_{1}, _{1}), (_{2}, _{2})) will be monotonic with respect to _{1}, _{1}), (_{2}, _{2})) for all choices of _{1}, ℓ_{2} and ℓ_{3}, we say that ℓ_{1} is nearer to ℓ_{2} than to ℓ_{3} if _{θ}_{1}, ℓ_{2}) > _{θ}_{1}, ℓ_{3}). Subsequently, this definition of “distance” is used to find

However, for every point ℓ_{i}_{θ}_{θ}_{θ}

We briefly turn to model fitting and estimation. For the approximation in (

Since the NNGP is a proper Gaussian process, we can use it as a prior for the spatial random effects in any hierarchical model. We write _{θ}_{θ}_{τ}_{i}_{ℛ}, _{i}_{N(ℓi)}, _{N(ℓi)} is an _{ℛ}. Prediction at any arbitrary location ℓ ∉ ℛ is performed by sampling from the posterior predictive distribution. For each draw of {_{ℛ}, _{ℛ}, ^{⊤}(ℓ)_{N}_{(ℓ)}, ^{2}(ℓ)) and

Another, even simpler, example could be modeling a continuous outcome itself as an NNGP. Let the desired full GP specification be ^{⊤}(ℓ)_{θ}_{θ}

The above model is extremely fast. The likelihood is of the form _{θ}_{θ}_{θ}^{2}_{ϕ}^{2}_{n}_{ϕ}^{2}, ^{2}}. These will also feature in the derived NNGP covariance matrix _{θ}^{⊤}(ℓ)^{2}(ℓ)). Note, however, that there is no latent smooth process

Likelihood computations in NNGP models usually involve ^{3}) flops. One does not need to store ^{2}. Substantial computational savings accrue because ^{5} spatial locations. For example,

Another important point to note is that _{θ}_{1}) _{2}) _{r}_{θ}_{i}_{j}^{2} exp(−_{i}_{j}^{2}, ^{2} given in the second column of _{θ}_{θ}^{2}_{ϕ}^{2}_{ϕ}_{i}_{j}

The article has attempted to provide some insight into constructing highly scalable Bayesian hierarchical models for very large spatiotemporal datasets using low-rank and sparsity-inducing processes. Such models are increasingly being employed to answer complex scientific questions and analyze massive spatiotemporal datasets in the natural and environmental sciences. Any standard Bayesian estimation algorithm, such as Markov chain and Hamiltonian Monte Carlo (see, e.g., ^{6} spatial and/or temporal points without sacrificing richness in the model.

While the NNGP certainly seem to have an edge in scalability over the more conventional low-rank or fixed rank models, it is premature to say whether its inferential performance will always excel over low rank of fixed rank models. For example, analyzing complex nonstationary random fields may pose challenges regarding construction of neighbor sets as simple distance-based definition of neighbors may prove to be inadequate. Multiresolution basis functions may be more adept at capturing nonstationary, but may struggle with massive datasets. Dynamic neighbor selection for nonstationary fields, where neighbors will be chosen based upon the covariance kernel itself, analogous to

There remain other challenges in high-dimensional geostatistics. Here, we have considered geostatistical settings where we have very large numbers of locations and/or time-points, but restricted our discussion to univariate outcomes. In practice, we often observe a ^{3}) over a large collection of spatial sites (again, say, order 10^{3}).

The linear model of coregionalization (LMC) proposed by

Spatial factor models (see, e.g.,

Computational developments with regard to Markov chain Monte Carlo (MCMC) algorithms (see, e.g.,

In terms of the hierarchical geostatistical models presented in this article,

The author wishes to thank the Editor-in-Chief (Professor Bruno Sansó) and the anonymous reviewers for very constructive and insightful feedback. In addition, the author also wishes to thank Dr. Abhirup Datta, Dr. Andrew O. Finley and Ms. Lu Zhang for useful discussions. The work of the author was supported in part by NSF DMS-1513654 and NSF IIS-1562303.

95% credible intervals for the nugget for 40 different low-rank radial-basis models with knots varying between 5 and 200 in steps of 5. The horizontal line at ^{2} = 5 denotes the true value of ^{2} with which the data was simulated.

Comparing estimates of a simulated random field using a full Gaussian Process (Full GP) and a Gaussian Predictive process (PPGP) with 64 knots. The oversmoothing by the low-rank predictive process is evident.

Sparsity using directed acyclic graphs.

Structure of the factors making up the sparse

95% credible intervals for the effective spatial range from an NNGP model with

Parameter estimates for the predictive process (PP) and modified predictive process (MPP) models in the univariate simulation.

^{2} | ^{2} | RMSPE | ||
---|---|---|---|---|

True | 1 | 1 | 1 | |

PP | 1.37 (0.29,2.61) | 1.37 (0.65,2.37) | 1.18 (1.07,1.23) | 1.21 |

MPP | 1.36 (0.51,2.39) | 1.04 (0.52,1.92) | 0.94 (0.68.1,14) | 1.20 |

PP 1.36 | (0.52,2.32) | 1.39 (0.76,2.44) | 1.09 (0.96, 1.24) | 1.17 |

MPP | 1.33 (0.50,2.24) | 1.14 (0.64,1.78) | 0.93 (0.76,1.22) | 1.17 |

PP | 1.31 (0.23, 2.55) | 1.12 (0.85,1.58) | 0.99 (0.85,1.16) | 1.17 |

MPP | 1.31 (0.23,2.63) | 1.04 (0.76,1.49) | 0.98 (0.87,1.21) | 1.17 |

Posterior parameter estimates, the Kullback–Leibler divergence (KL-D) and root mean square predictive errors (RMSPE) are presented for four NNGP models constructed from different topological orderings. The four orderings from left to right are “sorted on the sum of vertical and horizontal coordinate”, maximum-minimum distance (

NNGP from different topological orders | |||||
---|---|---|---|---|---|

True | Sorted coord(x+y) | MMD | Sorted x | Sorted y | |

1 | 0.79 (0.69, 1.04) | 0.80 (0.69, 1.02) | 0.80 (0.70, 1.05) | 0.83 (0.69, 1.08) | |

0.45 | 0.45 (0.44, 0.46) | 0.45 (0.44, 0.47) | 0.45 (0.44, 0.46 | ) 0.45 (0.44, 0.47) | |

5 | 8.11 (4.42, 11.10) | 7.63 (4.58, 10.97) | 8.01 (4.26, 11.18) | 7.12 (4.06, 11.03) | |

| |||||

KL-D | – | 24.04022 | 13.88847 | 22.30667 | 21.59174 |

RMSPE | – | 0.5278996 | 0.5278198 | 0.527912 | 0.527807 |