This is an Open Access article distributed under the terms of the Creative Commons Attribution License (

The assignment of a point-level geocode to subjects' residences is an important data assimilation component of many geographic public health studies. Often, these assignments are made by a method known as automated geocoding, which attempts to match each subject's address to an address-ranged street segment georeferenced within a streetline database and then interpolate the position of the address along that segment. Unfortunately, this process results in positional errors. Our study sought to model the probability distribution of positional errors associated with automated geocoding and E911 geocoding.

Positional errors were determined for 1423 rural addresses in Carroll County, Iowa as the vector difference between each 100%-matched automated geocode and its true location as determined by orthophoto and parcel information. Errors were also determined for 1449 60%-matched geocodes and 2354 E911 geocodes. Huge (> 15 km) outliers occurred among the 60%-matched geocoding errors; outliers occurred for the other two types of geocoding errors also but were much smaller. E911 geocoding was more accurate (median error length = 44 m) than 100%-matched automated geocoding (median error length = 168 m). The empirical distributions of positional errors associated with 100%-matched automated geocoding and E911 geocoding exhibited a distinctive Greek-cross shape and had many other interesting features that were not capable of being fitted adequately by a single bivariate normal or t distribution. However, mixtures of t distributions with two or three components fit the errors very well.

Mixtures of bivariate t distributions with few components appear to be flexible enough to fit many positional error datasets associated with geocoding, yet parsimonious enough to be feasible for nascent applications of measurement-error methodology to spatial epidemiology.

It is becoming increasingly common in public health studies to use the spatial locations of study participants in statistical analyses, for example to test for geographic clustering of disease or to estimate relationships between environmental exposures and disease. Indeed, statistical methods for spatial epidemiology are developing rapidly, and the growing list of book-length treatments of the subject include [

The spatial coordinates of a place of residence are usually not measured directly; rather, the residential address is given a location reference, known as a geocode. The geocode may be defined as the latitude and longitude coordinates or a point in some other coordinate system, or as a statistical tabulation area such as a U.S. Census tract, block group, or block. Here, unless noted otherwise, we use the point rather than areal definition. Several distinct methods for geocoding exist, including visiting the residence with global positioning system (GPS) receivers, identifying the residence on orthophoto maps based on aerial imagery, and matching the address to a digital street map. The latter can be done in batch mode for large numbers of addresses and when done this way is often called "automated geocoding." Recently, a new method of automated geocoding has been developed that matches an address to parcel descriptions of legal property boundaries developed by assessors, but this method has not yet been widely adopted. The U.S. Census Bureau is developing such a parcel-level geocode for all U.S. addresses, but the public does not and will not have access to these geocodes. Accordingly, automated geocoding here will refer to the widely used practice of using a geographic information system (GIS) to match an address to a street name and address range in a digitized street reference map and then estimate, via interpolation, where the address is located between the two points that define the limits of the address range.

Automated geocoding is cheaper, more convenient, and hence much more common than non-automated methods, but considerably less accurate. Several investigations of the accuracy of automated geocoding have recently been published. Some of these have measured accuracy by the proportion of addresses for which the geocode belongs to a correct statistical tabulation area; for example, Yang et al. [

An alternative method of geocoding that may have promise for public health research is E911 geocoding. E911 geocodes are usually obtained under the auspices of local governments for the specific purpose of dispatching emergency vehicles to the correct location in response to a 9-1-1 telephone call requesting assistance. The particular methods used to obtain the geocodes vary, but they generally are more resource-intensive than mere automated geocoding due to the life-and-death issues at stake. For example, some counties have used parcel address-matching, while others have hired commercial firms that claim to take a GPS measurement at or near each residence. Every year, more counties in the U.S. develop E911 geocodes, so it is possible that in the not-too-distant future, many health researchers will be able to use these geocodes in lieu of performing automated geocoding. Investigations of the accuracy of E911 geocodes have not yet appeared in the scientific literature, though commercial firms offering E911 geocoding services tout them, unsurprisingly, as much more accurate than geocodes obtained via automated geocoding.

Whatever process is used to obtain geocodes of residences, the positional errors incurred by that process introduce location uncertainties that may adversely affect spatial analytic methods. Specific effects of positional errors on spatial statistical analyses include inflation of standard errors of parameter estimates and a reduction in power to detect such spatial features as clusters and trends [

The main purpose of this article is to formulate and fit useful models for the probability distribution of positional errors incurred by geocoding residential addresses. In particular, we will formulate models that are sufficiently flexible to allow for the representation of features observed in empirical distributions of positional errors derived from a dataset of rural Iowa addresses, yet sufficiently simple that the aforementioned measurement-error and multiple imputation methodologies could be successfully implemented using these models. Positional errors corresponding to both automated geocoding and E911 geocoding will be considered. Upon formulating a suitable model or class of models for the errors, we will demonstrate how to fit those models to the data. Although the specific features seen in the distributions of positional errors from this predominantly rural Iowa county will not occur in all datasets, nor even in all error datasets derived from rural addresses, we believe that the methods we use to formulate and fit the models are generalizable to a great many datasets of positional errors incurred by geocoding.

The address data upon which this investigation is based consist of all 2516 rural residential addresses in Carroll County, Iowa, USA, current as of 31 December 2005, which we obtained in conjunction with a comprehensive study of rural health in Iowa by the Iowa Department of Public Health and other researchers at the University of Iowa. A major objective of the study was to investigate the possible existence of associations between various health outcomes and exposure to environmental contaminants produced by concentrated animal feeding operations. Hence the focus on rural addresses, which were defined as all residential addresses that lie outside incorporated township boundaries.

An attempt was made to obtain a geocode of each rural address using an automated method, an E911 method, and an orthophoto method, as follows.

As it happened, only 26 more addresses geocoded when a 60%-match criterion was used than when a 100%-match criterion was used, and of those additional geocodes, eight were extreme outliers occurring in three clusters located 12–16 km from their actual locations. A closer look at these outliers revealed that the extremely large positional errors were due to errors in the TIGER street centerline file such as an incorrect zip code, an address range for a street segment that fails to contain the house number, or a missing street segment. As a consequence of the automated geocoding software's matching algorithm, these errors tended to result in geocodes corresponding to an address with the same house number but lying on a street segment with a different but similar "name," e.g. "120th St" rather than "210th St," or "20th St" rather than "260th St." Rare, gregarious outliers such as these present a severe challenge to any modeling enterprise, including the mixture modeling approach to be featured here. Consequently, for our purposes we set these outliers aside and considered only the geocodes of 100%-matched addresses.

For emergency services dispatch purposes,

Using visual identification, the third author enhanced the E911 geocode for each address to a location centered on the residence related to the address. This task was accomplished with the aid of 24 inch/pixel grayscale orthophotos of the study area we obtained from the Carroll County GIS Administrator and color infrared orthophotos (with the same resolution) obtained from [

Of the three geocoding methods, the orthophoto method is by far the most accurate, hence the geocodes produced by this method were taken as the "gold standard" or truth. For each of the other two methods, the positional error corresponding to a given address was determined as the vector difference of the address's geocode obtained by the method and that address's orthophoto-derived geocode. For various reasons – most frequently the inability to determine which of several buildings in the photograph was the residence – a completely reliable orthophoto-derived geocode could not be ascertained for 162 of the addresses, so our analysis of positional errors is based on the remaining 2354 addresses.

In seeking useful models for a distribution of positional errors, one might first consider a bivariate normal distribution or a uniform distribution on a "standard" two-dimensional region (e.g. a circle or square). Indeed, normal and uniform distributions have been used previously to study the effects of location errors on spatial analyses in general, and on spatial prediction (kriging) and cluster detection in particular [_{1},..., _{g }in some proportions _{1},..., _{g}, respectively, where _{i }≥ 0 (

where _{i}(_{i}; _{1},..., _{g}). Furthermore, we focus on mixtures of bivariate normal and t distributions, which are the most commonly used mixture models for bivariate observations and are well-suited for observations contaminated by outliers and exhibiting multi-axial clustering. The t mixtures are more robust than normal mixtures to contamination by outliers, hence they generally yield more parsimonious models than normal mixtures for data with outliers.

For each of the two sets of positional errors – corresponding to automated and E911 geocodes – we obtained likelihood-based estimates of the parameters of normal mixtures and t mixtures for several values of

where _{i }and _{i}, are the mean vector and covariance matrix, respectively, of the _{1},..., _{g}, and _{1},..., _{g}, we find that the likelihood function corresponding to a random sample _{1},..., _{n }from

In this subsection the number of groups,

The likelihood equation,

is equivalent to the equations

for

The

The normal mixture likelihood-based estimation method just described was carried out for the Carroll County positional error data using the FORTRAN program EMMIX written by D. Peel and G.J. McLachlan, which can be downloaded freely from [_{i}, and the sample mean vector and sample covariance matrix of the observations belonging to the _{i }and _{i}, respectively.

For the t mixture models, we obtained likelihood-based estimates of parameters using the ECM (expectation-conditional maximization) method described by McLachlan and Krishnan [

where Γ(·) is the gamma function, and _{i }and _{i} are the mean vector and covariance matrix, respectively, and _{i }is the degrees of freedom parameter, of the _{1},..., _{n }from a

with _{i}(·) defined in (7) and with _{1},..., _{g}, _{1},..., _{g}, _{1},..., _{g}, and _{1},..., _{g}. Details of the implementation of the ECM estimation algorithm to t mixture models are too lengthy to report here; however, they can be found in [

In the previous subsection it was assumed that the number of components in the mixture distribution was known. While this assumption is appropriate for some applications of mixture models, for example when the subpopulations are males and females or a known number of age classes, it is generally not appropriate for modeling positional errors incurred by geocoding. Thus, the number of components in a mixture distribution for positional errors must be determined using the data at hand. Several methods for accomplishing this have been proposed, ranging from informal graphical techniques to more formal hypothesis testing procedures. Here, we choose the number of components using the

where L(

We provide the following example to illustrate the effectiveness of the mixture model estimation and model selection methodology. Two hundred observations were simulated from a bivariate normal distribution with means _{X }= _{Y }= 0 (for both variables), variances _{X }= _{Y }= 10, variances

Scatterplot of simulated data from two-component bivariate normal mixture model. The upper left panel displays 200 observations from the first component; the upper right panel displays 200 observations from the second component; the lower left panel is a superposition of the two upper panels; and the lower right panel displays a new simulation of 400 observations from the two-component normal mixture model fitted to the data from the original simulation.

First component:

Second component:

These estimates match the true parameter values very well. Finally, the fitted mixture model was used to generate a new set of 400 observations, which are also displayed in Figure

Of the 2354 rural addresses in Carroll County with orthophoto-derived geocodes, 1423 (60.5%) geocoded using the automated method with a 100%-match criterion. The positional errors (which are two-dimensional vectors) associated with these geocodes ranged in length from a minimum of 3 m to a maximum of 2896 m, with a median of 168 m, and are displayed as points in Figure

Scatterplot of positional errors (in meters) for the automated geocodes. The upper left panel displays the complete data; the upper right panel displays errors for addresses on streets aligned E-W; the lower left panel displays errors for addresses on streets aligned N-S; and the lower right panel is a superposition of the upper right panel and a 90-degree counterclockwise rotation of the lower left panel.

Manual checking of the fifty largest errors revealed that many were attributable to street segments in the TIGER/Line file that had correct street names but incorrect address ranges. Others appeared to be attributable to interpolation errors or possibly house address numbering "errors" (i.e. deviations from the distance-from-intersection rule or some other rule that was used when the houses were originally numbered). These database and procedural errors, in combination with the high degree of rectilinearity of the rural road network in Carroll County, produce the distinctive Greek-cross shape of the empirical distribution of positional errors. Outliers from this overall shape appear to be due to either very large offsets (e.g., one house was nearly 800 m from its corresponding street centerline), incorrect TIGER/Line file geometry, or both.

We do not have a ready explanation for the bias with respect to the origin exhibited by the errors. However, the fact that the mean errors are shifted to the east along E-W streets and south along N-S streets, in tandem with the fact that these directions of shift coincide with the directions in which rural house numbers are ascending, suggest that the explanation has something to do with a systematic interpolation or house numbering procedural error. As a follow-up, we computed the mean error for each individual street and found that these means were consistently, in fact invariably, to the east and south. Thus the bias is pervasive, not merely limited to a few streets.

Owing to the Greek-cross shape of the empirical distribution of the entire set of positional errors, no single bivariate normal or t distribution will fit them well, nor for that matter will

Bayesian Information Criteria (

Error dataset | Distribution | Number of Components | |

(a) | Normal | 1 | 48103 |

Normal | 2 | 45851 | |

Normal | 3 | 45236 | |

Normal | 4 | 45124 | |

t | 1 | 46083 | |

t | 2 | 45358 | |

t | 3 | 45056 | |

t | 4 | 45042 | |

(b) | Normal | 1 | 46422 |

Normal | 2 | 44809 | |

Normal | 3 | 44597 | |

Normal | 4 | 44557 | |

t | 1 | 45659 | |

t | 2 | 44538 | |

t | 3 | 44516 | |

t | 4 | 44459 | |

(c) | Normal | 1 | 67174 |

Normal | 2 | 63174 | |

Normal | 3 | 62710 | |

Normal | 4 | 62446 | |

t | 1 | 62841 | |

t | 2 | 62345 | |

t | 3 | 62219 | |

t | 4 | 62230 | |

(d) | Normal | 1 | 64227 |

Normal | 2 | 61360 | |

Normal | 3 | 61101 | |

Normal | 4 | 61059 | |

t | 1 | 61092 | |

t | 2 | 60980 | |

t | 3 | 60982 | |

t | 4 | 60994 |

Models with several different numbers of components, were fitted to the following four error datasets: (a) automated geocoding positional errors; (b) automated geocoding positional errors aligned with axial direction of corresponding street segment; (c) E911 positional errors; (d) E911 positional errors aligned with axial direction of corresponding street segment.

Likelihood-based estimates of the mean vector and covariance matrix for the three-component t model are given in Table

Likelihood-based parameter estimates for the best-fitting models.

Error dataset | Component | Proportion | _{X} | _{Y} | _{X} | _{Y} | ||

(a) | 1 | 0.571 | -12.1 | -10.7 | 61.6 | 54.1 | -0.05 | 1.6 |

2 | 0.253 | -4.7 | -350.0 | 75.9 | 550.0 | 0.18 | 6.5 | |

3 | 0.176 | 352.8 | -12.6 | 540.3 | 84.9 | -0.03 | 16.7 | |

(b) | 1 | 0.560 | -0.8 | -14.2 | 39.4 | 75.9 | 0.06 | 1.8 |

2 | 0.440 | 372.1 | -6.7 | 523.6 | 90.3 | -0.10 | 5.9 | |

(c) | 1 | 0.519 | 4.9 | -5.4 | 62.3 | 60.8 | -0.10 | 1.8 |

2 | 0.292 | 13.6 | -35.0 | 289.1 | 54.9 | -0.14 | 2.4 | |

3 | 0.189 | 14.9 | -10.2 | 62.1 | 354.4 | 0.14 | 2.4 | |

(d) | 1 | 0.700 | 5.9 | -4.3 | 47.0 | 100.7 | 0.06 | 1.8 |

2 | 0.300 | 29.3 | -6.2 | 62.1 | 419.5 | 0.16 | 3.0 |

Models and the datasets to which they were fitted are: (a) the three-component t mixture model for the automated geocoding positional errors; (b) the two-component t mixture model for the automated geocoding positional errors aligned with axial direction of corresponding street segment; (c) the three-component t mixture model for the E911 positional errors; (d) the two-component t mixture model for the E911 positional errors aligned with axial direction of corresponding street segment. Means are denoted by _{X }and _{Y}, standard deviations by _{X }and _{Y}, correlation coefficient by

Simulated data from the fitted three-component t mixture distribution for the automated geocoding errors. The upper left panel, upper right panel, and lower left panel correspond to components in order of decreasing _{i}; and the lower right panel is their superposition.

The lower right panel of Figure

The positional errors corresponding to the 2354 E911 geocodes (Figure

Scatterplot of the positional errors (in meters) for the E911 geocodes. The upper left panel displays the complete data; The upper right panel displays errors for addresses on streets aligned E-W; The lower left panel displays errors for addresses on streets aligned N-S; and the lower right panel is a superposition of the upper right panel and a 90-degree counterclockwise rotation of the lower left panel.

The orthogonal alignment of E911 errors occurs as a result of offset errors of substantial magnitude, which in turn are due to the definition of the E911 geocode in rural areas as the coordinates of the intersection of the public road and private road leading to the residence, coupled with the approximate perpendicularity (in most cases) of the angle between the public and private road. The outliers, for the most part, correspond to those cases for which the offset is relatively large and the private road meanders in such a way that a hypothetical line segment connecting the residence to the public road-private road intersection is far from being perpendicular.

Normal and t mixture distributions with various numbers of components were fitted to the E911 errors. Values of _{i}, and Figure

Simulated data from the fitted three-component t mixture distribution for the E911 geocoding errors. The upper left panel, upper right panel, and lower left panel correspond to components in order of decreasing _{i}; and the lower right panel is their superposition.

The lower right panel of Figure

The major question motivating this investigation was whether one could find useful models for the probability distribution of positional errors associated with geocoding, i.e. models that are sufficiently rich to adequately fit various geocoding error datasets yet sufficiently parsimonious to be practical for use as measurement-error models for statistical analysis. The answer to this question, based on our findings, is solidly (though not unequivocally) in the affirmative; and the class of models that seems best suited for the purpose is the class of mixture models of bivariate t distributions. These models can adequately fit such features as clustering along several axial directions, systematic bias in any direction(s), and outliers, all of which occurred in our data; simpler models such as uniform and normal distributions, which have been used previously for positional errors in spatial data, cannot. Moreover, t mixture models are feasible for use with emerging applications of measurement-error methodology to epidemiologic research [

The one situation we encountered in which mixture models of t distributions proved to be less than fully successful occurred with automated geocoding errors for which an address-matching threshold of less than 100% was used. In this situation, a few small clusters of extremely large errors occurred. Such errors are difficult to model parsimoniously and, regardless of how they are modeled, will weaken the conclusions made from subsequent statistical inferences using measurement-error methodology. Consequently, we recommend using only 100%-matched addresses for spatial epidemiologic analyses.

Our investigation indicated that t mixture models were equally useful for 100%-matched automated geocoding errors and E911 geocoding errors, despite some differences in their distinctive features. In particular, t mixtures were able to accommodate the difference in the major axis of error alignment relative to the alignment of the corresponding street (parallel for automated geocoding, perpendicular for E911 geocoding). The error distributions associated with other geocoding methods may have their own distinctive features (see [

Further investigation is currently underway to determine if t mixture models are as useful for positional errors corresponding to non-rural addresses as they appear to be for rural address positional errors and, if so, how the components might differ from those for rural addresses. Results from previous studies of positional errors for datasets combining both rural and non-rural addresses [

How might the methods developed here be adapted to the common situation in which it is not possible to obtain a "gold standard" geocode for each address that has been geocoded via automated geocoding? In some cases it may be feasible to obtain the more accurate geocode for a randomly selected portion of the addresses, from which the probability distribution of positional errors associated with automated geocoding may be estimated. This estimated distribution may then, as a practical matter, be presumed to apply to the entire set of addresses. In those cases where no sample of positional errors can be obtained, it may still be possible to estimate parameters of a probability distribution of positional errors, provided that a parsimonious model for the true locations of addresses is known (up to its unknown parameters). An illustration of this can be found in [

In focusing our attention on geocoding errors, we have ignored the fact that for many studies, automated geocoding is incomplete; that is, not all addresses can be assigned point-level spatial coordinates by the software. In fact, it is common in practice for 20% or even as many as 40% of subjects' addresses to fail to geocode using standard software and street files. For example, Gregorio et al. [

DLZ conceived of this study and drafted the majority of the manuscript. DLZ also directed, and XF performed, the statistical analysis. SM performed the automated geocoding and oversaw the orthophoto geocoding of the Carroll County data and contributed to the writing of the Methods section. GR contributed to the writing of several sections.

The work of the authors was supported by Centers for Disease Control and Prevention (CDC) Grant Number 3 R01 EH000056-01S1 with the Iowa Department of Public Health (IDPH) and Contract Number 5886CAR02 between the IDPH and the University of Iowa. The views expressed are solely those of the authors and do not represent the views of CDC or IDPH. We thank Carl Wilburn, GIS Coordinator for Carroll County, Iowa for providing address data and E911 geocodes for Carroll County.