In environmental health researches and practices, spatial analysis became an important approach to estimation of environmental exposure of human subjects under concern. A typical situation in this kind of application is that the data of pollution are available only at certain locations, and thus inference is needed to convert a limited number of values at discrete locations into a continuous surface. This paper intends to clarify the distinction among three methods that can be used to achieve this conversion, namely interpolation, kernel density estimation (KDE), and

In environmental health researches and practices, spatial analysis became an important approach to estimation of environmental exposure of human subjects under concern. The underlying assumption of this approach is that the concentration of a certain pollutant at a human subject’s location (e.g., residential location) can be taken as a substitute of the subject’s actual exposure to that pollutant in an epidemiological analysis. A typical situation that the researcher deals with in this kind of application is that the data of pollution are available only at particular locations, and thus a certain inference is needed to estimate concentrations at human subjects’ locations. Such an inference usually appears to be a process of converting a limited number of values at discrete locations into a continuous surface. Commonly employed methods to achieve such a conversion include interpolation and kernel density estimation (KDE).

Interpolation has been widely used in environmental health studies to map contaminations of soil (e.g.,

The similarity that all three can generate continuous surfaces in the raster format from discrete representations, typically points, may cause confusion and lead to misuses. This paper intends to clarify the distinction among the three methods, without overusing statistical concepts and terminology, and suggest appropriate applications for each.

For the convenience of description, we only discuss the situations that the input data are in the format of points. In real-world exposure estimations based on either of the three analyses, lines and polygons are usually first to be discretized into points, and thus the discussion here can be directly applied to the situations with lines and polygons. Also for the convenience of description, in many places of this paper we imply negative physical environmental factors by using words such as contamination and pollution, but the same idea and process are certainly apply to other types of environmental factors, including physical, socioeconomic, and behavioural attributes, and many of them can be positive to human health, e.g., infrastructure that improves walkability, healthy food outlets, recreational facilities, and greenspaces.

In this paper, to simplify the description we use _{o} is the value at a given un-sampled location; _{i} is the value of sample _{i} is the weight of sample _{i} is greatly related to the distance between locations _{i} is entirely determined by that distance; in a sophisticated kriging process, _{i} is determined by a sample-driven quantification of spatial autocorrelation, which takes in both spatial and attribute information. Since it is a weighted average calculation, it is constrained by:

This constraint indicates that the weight of a sample point is not determined by itself, but by its relationships with other samples. The result of a simple IDW interpolation that employs a linear distance decay function can be illustrative about its weighted-averaging nature (

The result of a kriging interpolation may not have the original sample values right on the interpolated surface. This is because it uses a derived global model to determine the weight of each sample. Nevertheless, the goal of the derivation is to make the model most optimally fit all the samples (or certain derived characteristics of them, in this case, the semivariogram). This is like that a derived regression line may not pass through all sample values, but the goal is still to best fit the line to the samples.

When estimating the environmental exposure, interpolation should be used when the available data are indeed from a limited number of sample locations. The term

In non-parametric statistics, the observed points in the kernel density estimation (KDE) are considered realizations of a _{i,o} is the distance between point _{i,o}. To ensure that K is a probability density, it must have the normalization feature:

When the purpose is to estimate probability density at a location,

Within a physical context, when the points represent locations of _{o} is the estimated concentration (exposure) value at a given location _{i} is the value of source point

To ensure that the KDE represented by

Another important difference between the interpolation and the KDE-based spreading process is that the points in the latter are not samples, but a complete set of the points that should be taken into account, e.g., all pollution sites in an area that should be included in the analysis.

Here we present a case study of estimating spatial distribution of pesticide pollution in New Hampshire (NH), US, which calculates the kernel density based on areal rather than point data. From the NH Department of Environmental Service, we acquired data that comprehensively describe the application history of various pesticides in all NH farms during 1965–1994. Each farm is represented by one or multiple polygons in the provided Shapefile. A separate Excel sheet lists details of pesticide applications, including the quantity of a particular pesticide in an application for a farm, the date of the application, and the acreage of the application. Here we use Maneb (a pesticide) in a single year as an example to describe the estimation process. The same process was applied to each pesticide in each year. The reason for aggregating the estimates about individual applications into years is that we need to correspond the estimates (environmental exposure) to the data of patients’ migration histories that are organized by years in an environmental health study of a certain disease.

KDE requires the input data to be points. Thus the general idea in this analysis is to first convert each polygon into an agglomeration of points, then attach the application quantity of Maneb to those points, and finally use the value points generated in this way to calculate the kernel density, which is an estimate of quantity of Maneb at each and every location in NH. The analysis encountered a number of challenges sourced from the complexity in the data. Among others, one challenge is that the record of an application is about a farm, but a farm may have multiple fields (polygons), and there is no information about which field(s) the pesticide was applied to; also, the recorded acreage of an application are often much smaller than the area of a polygon, and there is no information about to which portion(s) of the polygon the pesticide was applied. To compile the data into a set of reasonable points, each attached with a reasonable quantity of applied Maneb, so that the KDE can be performed, we went through a four-step process described as follows, and illustrated by

First, if the farm has multiple fields, without further information, we assumed that the quantity of Maneb, as well as the acreage, of a single application had been proportionally allocated to those farmlands based on their areas (implying that an application area is homogeneous in terms of the concentration of Meneb). Based on this assumption, we calculated the quantity and acreage of an application for each field polygon.

Second, considering the spatial concentration of an application (i.e., it is not likely that a small acreages of application in a relatively large farmland would be evenly scattered across the entire field), and also balancing the precision and computing burden, we chose to use one point to represent an area of 4 hectare. With this setting, we calculated the number of points that should be used to represent a polygon, based on the acreage of the application that was allocated to the polygon.

Third, according to the number of points for a polygon, we generated random points for the polygon. The total quantity of Maneb applied to the polygon were equality divided among the points and the value was attached to each point.

Fourth, using the generated value points, we created raster layers of the Maneb distribution through KDE. We tested two bandwidths, including 500 m and 1,000 m, representing how far the Maneb from a source could reach.

The KDE represented by _{2.5} concentration measured at the location of a factory chimney at a certain time of a day. Noteworthy, while such data can also be called

These temporal samples at the sources give a snapshot of the sources during the pollutant spreading process. To estimate the environmental exposure at every location, we want to use the snapshot of the sources to infer a snapshot of the entire area. In this paper, we propose to use the term

In contrast to the KDE represented by

A fundamental feature of snapshotting that is not fully implied by

In _{i} denotes the kernel function applied to source

_{1}, _{2},…, _{n}; and denote _{i,j}, where _{1}, _{2}, …, _{n} can be solved as:

The sources’ own values (_{1}, _{2},…, _{n}) calculated with

A rule of thumb we propose for selecting among the three methods: If we know that the pollution data are spatially about the source(s), either point sources or non-point sources, KDE or

Stemming from that rule, a requirement to the data for KDE and

Some studies develop regression models to infer pollution (exposure) values at un-sampled locations based on the measured values at sample locations (serving as the dependent in the model) and data of related environmental factors (serving as the independents in the model) (e.g.,

In both interpolation and

The essential difference between interpolation and

The choice of different kernel function (

The meanings and intentions of the three methods are fundamentally distinctive, which is summarized in

Xun Shi is a Professor of Geography at Dartmouth College, USA. His primary research interest is in spatial analysis and its application in human health studies.

Meifang Li is a Ph.D. candidate at School of Geography and Planning, Sun Yat-sen University, China. She is currently a visiting scholar at the Geography Department, Dartmouth College. Her primary research interest is in spatiotemporal modelling of communicable diseases.

Olivia Hunter is a senior undergraduate student at Dartmouth College, USA. She is majored in Biology, with a minor in Geography.

Bart Guetti is a freelance researcher specialized in GIS and spatial analysis.

Angeline Andrew is a Professor of Epidemiology at Geisel School of Medicine, Dartmouth College, USA. Her primary research area is in environmental health.

Elijah Stommel is a Professor of Neurology at Geisel School of Medicine, Dartmouth College, USA. He specialized in environmental impacts on neural diseases.

Walter Bradley is a Professor of Neurology at University of Miami, USA. He specialized in environmental impacts on neural diseases.

Margaret Karagas is a Professor of Epidemiology at Geisel School of Medicine, Dartmouth College, USA. Her primary research area is in environmental health.

An illustration of the result from an inverse-distance weighting (IDW) interpolation: each vertical bar represents a sample, with the height indicating the sample value; the thin lines passing through the tops of the bars represent the interpolated continuous surface. The result is generated using the decay factor

An illustration of the result from kernel density estimation (KDE): each vertical bar represents a value point, with the height indicating the attribute value; the curvy thin line represents the estimation result, using the Epanechnikov function

A process of using kernel density estimation (KDE) to model spread of pesticides from farms. The polygons are farmlands, with A1 and A2 belonging to farm A, and B belonging to farm B. To run KDE, we first convert polygons into points by generating random points within each polygon, with the number of points to generate determined by the area of the polygon. The total amount of pesticide in an application of a farm is first proportionally allocated to different polygons of the farm, according to the areas of the polygons (not according to the number of points), and then equally allocated to each point in the polygon. These value points are then used to generate the density surface of pesticide. Note: that the points in A1 and A2 have different pesticide values is due to the precision loss in discretizing polygon area into number of points.

An illustration of the result from

An illustration of the result from

summarizes the contrast among the spatial interpolation, spatial KDE, and spatial

Equations | Input | Output | Application Examples | |
---|---|---|---|---|

Interpolation | Spatial samples selected from a physical surface. | An inferred physical surface. | • Use measurements from randomly allocated monitoring stations to estimate the exposure to PM_{2.5}. | |

KDE | Total quantity at a complete set of source locations. | The result of a spreading process. | • Use governmental records of farm applications to estimate the exposure to pesticides. | |

Temporal samples measured at a complete set of source locations. | A snapshot for a state during a spreading process. | • Use measurements from street intersections to estimate the exposure to PM_{2.5}. |