This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
This research develops methods for determining the effect of geocoding quality on relationships between environmental exposures and health. The likelihood of detecting an existing relationship – statistical power – between measures of environmental exposures and health depends not only on the strength of the relationship but also on the level of positional accuracy and completeness of the geocodes from which the measures of environmental exposure are made. This paper summarizes the results of simulation studies conducted to examine the impact of inaccuracies of geocoded addresses generated by three types of geocoding processes: a) addresses located on orthophoto maps, b) addresses matched to TIGER files (U.S Census or their derivative street files); and, c) addresses from E-911 geocodes (developed by local authorities for emergency dispatch purposes).
The simulated odds of disease using exposures modelled from the highest quality geocodes could be sufficiently recovered using other, more commonly used, geocoding processes such as TIGER and E-911; however, the strength of the odds relationship between disease exposures modelled at geocodes generally declined with decreasing geocoding accuracy.
Although these specific results cannot be generalized to new situations, the methods used to determine the sensitivity of results can be used in new situations. Estimated measures of positional accuracy must be used in the interpretation of results of analyses that investigate relationships between health outcomes and exposures measured at residential locations. Analyses similar to those employed in this paper can be used to validate interpretation of results from empirical analyses that use geocoded locations with estimated measures of positional accuracy.
Geocodes are geographic references for computer records that lack them [
The motivating question for the research in this paper is, how do errors in geocodes affect estimates of the relationship between environmental exposures and health outcomes? Statistical power in a model measuring the relationship between exposures and health is computed for different geocoding processes. The results are intended to help researchers decide whether a geocoding method under consideration in an environmental health study is adequate for risk assessment. A second motivating question asks whether it is possible to know the level of geocoding accuracy that is needed to establish the health risk of environmental contaminants in an area. We assume that the contaminant locations can be measured precisely and that the locations of persons exposed to the contaminants are subject to uncertainty. Our approach is similar to that taken by Rull and Ritz [
We use an experimental method to determine the effect of geocoding inaccuracy on the ability to recover relationships between environmental exposures and health. In our experiments, hypothetical risk models are used to simulate health outcomes for a given spatial pattern of environmental contaminants and a given spatial pattern of exposed individuals. For the given spatial pattern of contaminants, we generate health data for hypothetical individuals living at known address locations in Carroll County, Iowa. The address locations used to calculate the environmental contaminant values and subsequently generate the expected health outcomes are highly accurate geographic locations obtained through geocoding the residential structures corresponding to each address based on their recognition on a properly registered, orthophoto map. This geocoding process is abbreviated as Go. We then ask how this known relationship compares with estimated relationships between environmental exposures and health outcomes based on two other methods for geocoding the addresses. One method uses the emergency responders geocoding process – GE (E-911 geocoding) – and the other uses the well known automated address-matching approach using TIGER line files from the US census. (GT, with and without offset). TIGER is an acronym for Topologically Integrated Geographic Encoding and Referencing. In the experiments described in more detail below, measures of exposures are degraded because of geocoding errors in the locations of individuals. The effect of these errors is assessed by examining the accuracy of resulting odds ratio estimates. It is not our objective in this study to determine which geocoding process is optimal. Such an analysis could be a natural extension of this work. In this study we develop methods to study the effect of geocoding inaccuracy on the relationships between environmental exposure and health. We realize this using the three exemplar geocoding processes. In the next section, we discuss the theoretical framework underlying our approach.
While it is possible to apply the method outlined in this research to aggregated health/environmental data (e.g. aggregated at the level of the Census tract), we confine this discussion to the use of individual level address data. We assume that the dataset consists of N unique addresses, with one individual resident at each address. Like Armstrong et al [
In addition, the expected health effect E(Wi) can be modelled as a function of Zi and covariates Li as E(Wi) = g(Zi, Li), where g( ) is often a linear or logistic regression model. A simpler approach is to model health outcomes as a function of the environmental contaminant only; i.e.
E(Wi) = g(Zi). It can thus be seen that the model relating W to the contaminant is a function of the geocoding process as well:
Note that given a function 'g', a GIS contaminant model 'm', and known Ai' s, the left hand side of equation (2) can be simulated from the right hand side. Wi can often be represented by a binary variable. For example, in population-based studies cases could be coded as 1s and controls as 0s. Alternatively, if the study design is a proportionate incidence or proportionate mortality study, then a certain ICD-9 (International Classification of Diseases-Version 9) code can be coded as 1 and all other ICD-9 codes as 0. In such instances, we can express Wi as a Bernoulli random variable where:
and
The probability function of Wi is:
The relationship between
The contaminant values Z are continuous in nature, and the associated model parameter is interpreted as follows: every unit increase in exposure to the contaminant Z causes an increase of
From equations 1 and 4, we can write:
From 5, we see that for a given address, relationship
With reference to equation (4), under the null hypothesis, there is no relationship between exposure Z and health outcomes. The odds of disease from having been exposed is therefore 1.
Under the non-informative alternative hypothesis, the odds of disease is different from zero. While an exposure to contamination usually increases the odds of disease and we can expect this to be greater than one, we allow for the possibility of the odds being less than one; i.e. an alternative hypothesis of
For a given alternative
a) Disease data W are simulated according to a known relationship g(.) between Z and W.
b) All model parameters other than G remain constant, and an effort is made to estimate the relationship between Z and W. The extent to which the estimated relationship varies from the true relationship, as G varies, is a measure of the decline in the quality of G. In this study, we apply this procedure to a real situation occurring in Carroll County, Iowa. The three types of geocoding processes examined are typical of those that are used for counties in the Midwestern U.S.
The data we wish to develop consist of residential sites and associated contaminant values. Three geocoding processes were used to develop these datasets:
These are geocodes in which addresses are matched to Census street centerline files. Centerline files are produced by the U.S. Census and were available to us from the E.S.R.I's (Environmental Systems Research Institute) website [
E-911 geocodes are a promising means of accurately geocoding rural addresses [
Using visual identification, the E-911 rural addresses were 'enhanced' to a location centered on the residence location related to the address. This task was accomplished with the aid of 6 inch (15.2 cm)/pixel and two feet (61 cm) per pixel orthophoto maps of the study area, current as of 2002. Figure
Locations of the geocoded rural addresses in Carroll County, Iowa.
Parcel geocoding was not considered as a reliable geocoding method in these analyses. The median parcel size for the properties of both farm and non-farm residences in rural Carroll county is 1,618,703 square feet or 179,856 yards (150,382 square meters), so that a geocode placed at the centre of a square parcel of this size would have a median error of approximately 671 feet (204 m). This error can be reduced with the help of ancillary knowledge of the location of the residence. Since the likely source of this knowledge would be an orthophoto image, anyone possessing this source would be better advised to extract the location of the residence as we have done in this work.
Figure
Illustration of three types of geocoding : orthophoto, E-911 and TIGER with offset, for the address 10392 260th Street.
We started with a comprehensive dataset of 2,516 addresses representing all rural addresses in Carroll County. All addresses that are located outside the legal (incorporated) boundaries of towns are considered rural. For each address an E-911 geocode is available. The E-911 geocodes therefore have 100% completeness. Since the orthophoto geocodes are enhanced from the E-911 geocodes, all addresses have an orthophoto geocode. Of the 2,516 addresses 14 were found to be duplicates and eliminated. A further 69 addresses were found to be have been erroneously coded as rural and removed. The remaining 2,443 addresses were geocoded to TIGER street centerline files. A minimum match score of 100 % was used and no manual interactive matching was used because the purpose of this research is to show the effects of typical differences in locations between "perfectly geocoded" residences according to currently accepted geocoding processes (automated TIGER, E-911) and ground truth as exemplified by the orthophoto determined locations. 1, 581 of the 2,443 addresses were geocoded with 100% match score to the TIGER Street Centerline files indicating a match rate of 64.7%. Our results represent a conservative view of the difference between TIGER, E-911 geocoded locations and ground-truth locations. Clearly, addresses that could not be geocoded accurately from the TIGER file would represent a systematically larger error than those studied here and bias would be introduced by any attempt to interactively geocode the unmatched addresses.
This research thus utilizes the 'incomplete' [
In this research we utilize CAFOs (Concentrated Animal Feeding Operations) as the disease-causing contaminant source. CAFOs have been suspected as possible sources of disease-causing effluents in rural areas of the U.S. [
The locations of 55 CAFOs in Carroll County, for which permits had been issued by the state were obtained as a GIS layer file. A plume dispersal model based on the AERMOD (
The model was realized with a combination of MaTLab [
For the purposes of visualization, contaminant values were also computed for a 50 meter fine grid and the values were contoured in ArcGIS [
Estimated values of CAFO emissions in a region of South-East Carroll County, Iowa.
The simulation methodology consists of the following 8 steps:
1) Assume that one individual resides at each address. Simulate probabilities of disease for N = 1,581 individuals –
We take
2) Randomly sample M individuals out of N. (See step 7 for values of M)
3) For a given sample generate disease outcomes according to the distribution Wi ~ binomial (
4) Our objective here is to recreate the relationship that was used in simulating the health data in step 1. In trying to do this, we consider (i) that we only have a sample of M addresses/people from our dataset of N people/addresses and (ii) that the contaminant values given to us have been calculated using all three forms of geocoding which are m(G (Ao)) = Zo (Orthophoto geocoding), m(G (AT)) = ZT (TIGER geocoding) and m(G (AE)) = ZE (E-911 geocoding). For any given person/address, different logistic regression estimates of
Thus, for a given person/address, while the outcome is the same for all the models, the predictor in a) is m(G (Ao)) = Zo ; in b) it is m(G (AE)) = ZE ; and in c) it is m(G (AT)) = ZT.
5) Repeat steps 2 – 4 10,000 times.
Estimates of
6) Recovered odds (or the recovered relationship) are calculated as:
Power = (Number of significant
Finally, Confidence Intervals are calculated as:
7) Steps 2 to 6 are repeated with different values of M, to study how power varies with sample size, for different geocoding processes. The values of M chosen were 316, 474, 632, 790, 948, 1106, 1264, 1422 and 1581 which correspond to 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% of N respectively.
Sensitivity analysis was carried out to test the behaviour of the simulation when subjected to different values of input parameters. Specifically the
We define a geocoding error as the difference in distance units between the Orthophoto geocode and the geocode (E-911, TIGER) for a given address. Analyses of the TIGER (GT) geocoding errors and the E-911 geocode errors showed a median difference of 693 feet (211.23 m) for TIGER geocodes and 151 feet (46 m) for E-911 geocodes. Table
Summary statistics for distance errors.
| 3702.83 | 2137.26 | 1631.09 | 692.87 | 20.82 | 46600.53 | |
| 3634.15 | 2058.53 | 1556.22 | 654.15 | 3.97 | 46400.56 | |
| 3641.91 | 2040.28 | 1582.01 | 615.19 | 20.70 | 46401.50 | |
| 3658.40 | 2055.12 | 1653.10 | 670.92 | 38.85 | 46403.30 | |
| 3687.35 | 2095.46 | 1751.19 | 814.97 | 27.68 | 46405.97 | |
| 698.36 | 318.25 | 281.07 | 150.92 | 6.32 | 3196.67 | |
Correlations between contaminant values for each of 1,581 rural addresses for seven different geocoding methods.
| 1.00 | ||||||||
| 0.90 | 1.00 | |||||||
| 0.65 | 0.75 | 1.00 | ||||||
| 0.68 | 0.79 | 0.96 | 1.00 | |||||
| 0.71 | 0.79 | 0.94 | 0.97 | 1.00 | ||||
| 0.70 | 0.80 | 0.93 | 0.94 | 0.97 | 1.00 | |||
| 0.71 | 0.81 | 0.89 | 0.92 | 0.93 | 0.96 | 1.00 | ||
Relationship between error in contaminant values at TIGER geocodes with true contaminant value.
Relationship between error in contaminant values at E-911 geocodes with true contaminant value.
One exploratory method of comparing the effect of errors in contaminant values from geocoding errors is the method of calculating the attenuation of odds ratios [
Variation in odds ratios in simulated disease from exposure to contaminant calculated to Orthophoto geocode and exposure to contaminant calculated to TIGER geocode, with error in contaminant calculation at a TIGER geocode.
Variation in odds ratios in simulated disease from exposure to contaminant calculated to Orthophoto geocode and exposure to contaminant calculated to E-911 geocode, with error in contaminant calculation at an E-911 geocode.
We tested the robustness of the simulation by changing the value of the simulated odds. The results of this sensitivity analysis are summarized in Table
Simulated mean estimates across fixed values of the true odds and the geocoding method.
| 0.98 | 1.15 | 1.21 | 1.51 | 2.02 | ||
| 0.99 | 1.16 | 1.21 | 1.44 | 1.76 | ||
| 0.99 | 1.15 | 1.21 | 1.45 | 1.74 | ||
| 0.99 | 1.15 | 1.21 | 1.47 | 1.80 | ||
| 0.99 | 1.19 | 1.25 | 1.56 | 1.91 | ||
| 0.99 | 1.17 | 1.24 | 1.50 | 1.82 | ||
| 1.00 | 1.19 | 1.26 | 1.54 | 1.87 | ||
Simulated power estimates across fixed values of the true odds and geocoding method.
| 5.6 | 74.9 | 93.2 | 100.0 | 100 | ||
| 5.0 | 65.7 | 87.0 | 100.0 | 100 | ||
| 4.9 | 37.3 | 57.5 | 99.8 | 100 | ||
| 5.2 | 40.7 | 62.1 | 99.9 | 100 | ||
| 5.1 | 44.5 | 66.1 | 99.9 | 100 | ||
| 4.8 | 43.7 | 65.9 | 99.9 | 100 | ||
| 5.0 | 45.5 | 67.5 | 100.0 | 100 | ||
The relationships (odds) are recovered with almost no error across different sample sizes and geocoding processes (Table
Recovered odds ratios (true value is 1.2) by types of geocoding and number of people in a sample. Estimates and power are based on 10,000 simulated samples.
| 1.18 | 0.52–1.71 | 29.4 | ||
| 1.17 | 0.53–1.65 | 31.1 | ||
| 1.18 | 0.50–1.82 | 19.0 | ||
| 1.19 | 0.75–1.56 | 43.4 | ||
| 1.19 | 0.75–1.52 | 41.8 | ||
| 1.19 | 0.69–1.66 | 24.3 | ||
| 1.20 | 0.87–1.49 | 56.5 | ||
| 1.20 | 0.87–1.45 | 51.8 | ||
| 1.20 | 0.78–1.56 | 30.2 | ||
| 1.20 | 0.95–1.44 | 66.1 | ||
| 1.20 | 0.94–1.42 | 59.4 | ||
| 1.20 | 0.86–1.51 | 35.5 | ||
| 1.21 | 0.99–1.42 | 74.5 | ||
| 1.21 | 0.98–1.39 | 66.4 | ||
| 1.20 | 0.90–1.48 | 40.1 | ||
| 1.21 | 1.04–1.40 | 81.1 | ||
| 1.21 | 1.02–1.38 | 73.3 | ||
| 1.21 | 0.94–1.46 | 44.9 | ||
| 1.20 | 1.05–1.38 | 85.6 | ||
| 1.21 | 1.04–1.36 | 78.3 | ||
| 1.20 | 0.95–1.43 | 49.7 | ||
| 1.21 | 1.07–1.37 | 89.7 | ||
| 1.21 | 1.06–1.35 | 83.5 | ||
| 1.21 | 0.98–1.41 | 53.6 | ||
| 1.21 | 1.08–1.36 | 93.2 | ||
| 1.21 | 1.08–1.35 | 87.0 | ||
| 1.21 | 1.00–1.40 | 57.0 | ||
Recovered odds ratios by TIGER geocoding with variable offsets and number of people in a sample. Estimates and power are based on 10,000 simulated samples.
| 1.18 | 0.50–1.82 | 19.0 | ||
| 1.18 | 0.50–1.84 | 19.6 | ||
| 1.22 | 0.50–1.91 | 21.4 | ||
| 1.20 | 0.50–1.84 | 21.7 | ||
| 1.23 | 0.49–1.91 | 22.4 | ||
| 1.19 | 0.69–1.66 | 24.3 | ||
| 1.19 | 0.68–1.67 | 26.3 | ||
| 1.24 | 0.68–1.74 | 28.9 | ||
| 1.22 | 0.69–1.68 | 28.3 | ||
| 1.24 | 0.68–1.74 | 28.8 | ||
| 1.20 | 0.78–1.56 | 30.2 | ||
| 1.20 | 0.79–1.58 | 33.1 | ||
| 1.24 | 0.79–1.66 | 35.9 | ||
| 1.22 | 0.80–1.60 | 35.6 | ||
| 1.25 | 0.79–1.65 | 36.2 | ||
| 1.20 | 0.86–1.51 | 35.5 | ||
| 1.20 | 0.86–1.52 | 38.7 | ||
| 1.25 | 0.86–1.60 | 41.7 | ||
| 1.23 | 0.87–1.55 | 41.2 | ||
| 1.25 | 0.86–1.60 | 42.7 | ||
| 1.20 | 0.90–1.48 | 40.1 | ||
| 1.21 | 0.90–1.49 | 44.0 | ||
| 1.25 | 0.92–1.56 | 47.5 | ||
| 1.23 | 0.92–1.51 | 46.9 | ||
| 1.25 | 0.92–1.56 | 48.1 | ||
| 1.21 | 0.94–1.46 | 44.9 | ||
| 1.21 | 0.94–1.46 | 49.1 | ||
| 1.25 | 0.96–1.54 | 53.2 | ||
| 1.23 | 0.96–1.49 | 52.9 | ||
| 1.26 | 0.96–1.54 | 53.8 | ||
| 1.20 | 0.95–1.43 | 49.7 | ||
| 1.21 | 0.96–1.44 | 53.4 | ||
| 1.25 | 0.97–1.51 | 57.3 | ||
| 1.23 | 0.97–1.47 | 56.6 | ||
| 1.25 | 0.98–1.51 | 58.0 | ||
| 1.21 | 0.98–1.41 | 53.6 | ||
| 1.21 | 0.98–1.42 | 58.3 | ||
| 1.25 | 1.00–1.49 | 62.8 | ||
| 1.23 | 1.00–1.45 | 62.4 | ||
| 1.26 | 1.01–1.49 | 63.4 | ||
| 1.21 | 1.00–1.40 | 57.5 | ||
| 1.21 | 1.00–1.41 | 62.1 | ||
| 1.25 | 1.02–1.48 | 66.1 | ||
| 1.24 | 1.02–1.44 | 65.9 | ||
| 1.26 | 1.03–1.48 | 67.5 | ||
Estimated power as a function of sample size and geocoding method.
This paper investigated the degree to which the recovery of a known relationship between environmental exposure and health is affected by the geocoding quality of the subjects of the research. Power analyses showed that the quality associated with different geocoding processes affected the ability to recover the relationships. As with all power analyses the size of the sample as well as the variability in the contaminant surface and the location of the sample in relation to this surface also affected the ability to recover the relationship. Because state or local regulations often control the locations of CAFOs relative to the residences of people, the numbers of people living in areas of high exposure to CAFO contaminants is limited which, in turn, limits the ability to detect health effects in natural experiments as in this research [
The methods used in this paper can be adapted to other situations where the effect of environmental contaminants on health is the subject of study. Because linked social-spatial data [
Our results suggest that studies of relationships between environmental contaminants and health may be better designed by using spatial sampling procedures that identify locations of residences that equalize the number of subjects for different estimated levels of the contaminant load. Random samples of subjects are unlikely to have such characteristics and power analyses based on such samples will be less effective. With the widespread availability in the U.S. and elsewhere of E-911 or similar master address lists, and the availability as in this study of spatially modelled contaminant surfaces, determining such spatially stratified random samples that parsimoniously identify respondent locations will improve the quality of analyses of effects of contaminants on health.
A common problem faced by researchers of this subject is that they cannot know
Analyses to predict the ability to detect relationships between contaminant values at given locations and health will generally need to incorporate known demographic covariates that are also predictive of a health effect. Power analyses can be designed to incorporate covariates. A recurring question in geographic information science is whether particular geospatial databases are sufficiently accurate for the purpose for which they are used. Determining "fitness-for-use" of a geospatial data set is difficult and has been the subject of research in GIScience [
Although spatial databases are becoming more accurate as GIS technology improves and efforts are made to improve the accuracy of geographic base maps, it is accepted that no single level of accuracy will meet the requirements of every purpose for which spatial data is used. For each use, there are accuracy requirements and the question we asked is which of three widely used measures of location is adequate for the purpose of assessing whether a relationship exists between exposure to environmental contaminants and health. While research in geocoding accuracy and environmental health problems has often focussed on the effect of inaccuracies on an observed prevalence or relationships [
An experimental method to investigate the effect of geocoding accuracy is proposed in this paper. The method of accuracy assessment takes into consideration the 'purpose of use' of the geocodes in an environmental health context. Since a goal of such research is to examine relationships between health and exposure, the proposed method focuses on estimation of disease risk in the presence of modelling errors introduced through geocoding inaccuracies. We examine three widely used geocoding processes. Health data are simulated using known odds from exposure to a contaminant. The contaminant values are calculated using a gold standard geocode. These odds are then detected using contaminant values calculated using two other (apart from the gold standard) geocodes. Of the three geocoding processes studied all were successfully able to recover the simulated odds, though the strength of the relationship varied from process to process. In these analyses E-911 geocoding came out superior to TIGER geocoding (with and without offset). More research is required to decide on an 'optimal geocode', since we have not evaluated all possible offsets of TIGER geocoding, E-911 with offsets and other geocoding processes such as GPS based or parcel based geocoding. Sensitivity analyses show relative robustness of the model at recovering the simulated odds. While the specific results obtained in this research may not be generalized to other situations the method can be applied in any situation where issues of geocoding accuracy are in question in an environmental epidemiological study. Our research extends the literature in geocoding quality analysis by placing it in the context of decision making in environmental epidemiological studies.
The author(s) declare that they have no competing interests.
SM proposed the experimental design for the paper, conducted the analyses and wrote sections of the paper. GR oversaw the GIS part of the research and wrote sections of the paper. BJS wrote the program for the simulation (in R) which SM modified for this research and extensively reviewed the paper. DZ reviewed statistical details of the paper and revised the statistical sections. KJD directs the research project to which this paper contributes. He wrote sections of the paper.
Work on this research was supported by CDC Grant Number 3 R01 EH000056-01S1 with the Iowa Department of Public Health (IDPH) and Contract Number 5886CAR02 between IDPH and the University of Iowa, Kelley J Donham, Principal Investigator. The MaTLab CAFO plume model was written by Patrick T. O'Shaughnessy, Associate Professor at the Department of Occupational & Environmental Health, University of Iowa. The views expressed are those solely of the authors and do not represent the views of the funding agencies or of the IDPH. The authors are grateful to five anonymous reviewers whose comments helped improve this paper.