Statistical criteria are needed by which to evaluate the potential success or failure of applications of small area estimation. A necessary step to achieve this is a protocol-a series of steps-by which to assess whether an instance of small area estimation has given satisfactory results or not. Most customary attempts at evaluation of small area techniques have deficiencies. Often, evaluation is not attempted. Every small area study requires an

Small area estimation is employed worldwide in many important applications, for example determining the allocation of funds. It has long history and a rich literature with a variety of ingenious techniques and well-developed theory. Already in 1979,

In 1979, there was a conference on Synthetic Estimation, the prominent small area estimation technique of the day. At its end, Richard Royall, the pivotal figure in current-day model-based sampling theory, issued a warning, which, with a slight modification of terms might still be applied to present day conferences on small area estimation:

‘A workshop of this sort, focused on a specific technique, can spur development, but it can also be dangerous. The danger is that, from hearing many people speak many words about [small area estimation], we become comfortable with the technique. The idea and the jargon become familiar, and it is easy to accept that “Since all these people are studying [small area estimation], it must be okay.” We must remain skeptical and not allow familiarity to dull our healthy skepticism. (

Why would someone, this author included, who thinks the proper understanding of survey sample inference lies in the proper use of models, hesitate over small area estimation, a procedure resting as it does on the sophisticated use of models? We will suggest an answer in the succeeding section.

The enterprise of small area estimation arises because of the collision of two factors:

There is a general tacit assumption that these two factors are reconcilable; that while we would

Here is a thought experiment. Suppose (a) the

Surely there is a limit to how far this cycle could go on. If the resources were to dry up to

We do not attempt in this paper to answer this question. To answer it, we need to be able in general to evaluate small area projects and to have gained considerable experience in such evaluation. For the most part, our current experience in evaluation of small area estimation is inadequate. ‘The main limitation of small area methods … has been the difficulty in validating a particular approach for a given … problem. Standard approaches … are not useful … do not adequately answer the question of how well these methods work compared to … a large sample survey in each locality’. (

The problem is exacerbated by the fact that we turn to small area estimation precisely because of the fact that many if not most of the areas of interest are under-sampled or not sampled at all. For example,

The ‘gold standard’ of evaluation has been evaluation of results against large external data sets derived from censuses or administrative data (cf.

A variety of other evaluation procedures have been used over the years, each having some weakness: (1) that the small area estimates are

The key problem is the lack of data precisely where they are needed to verify the validity of assumptions (for example in the 2 307 small areas lacking any sample in

We should perhaps stress that we are here addressing the situation where small area estimation is

We want to keep in mind the twin goals of the survey sample enterprise, which are the same as statistical estimation in general: (a) sharp accuracy (efficiency) and (b) sound inference. Accuracy: how close is the small area estimate to its target? Inference: does a confidence interval or its equivalent, derived in small area estimation typically from an estimate of mean square error, actually cover the target in accord with its stated coverage?

Both accuracy and inference strongly suggest the need for an

Having an external basis of comparison does not necessarily mean an external

Every survey _{A} of these small areas, with a particular focus on those that will be weakly (or not at all) sampled in _{A} has a sample _{a} taken within

Let _{A} of _{A} areas be drawn; ‘appropriate’ may mean, for example simple random sampling (_{A}, a supplementary sample _{a} will be taken of size _{a}, where _{a} is large enough that the direct estimates based on _{a} can be regarded as normally distributed with variances well estimated and not large. The direct estimates and variance estimates will then be available for shedding light on the corresponding small area estimates derived from the main sample

_{A}, which supplies a collection of areas _{a} intended to give accurate estimates for the areas _{A}, quite independently of any data _{A} as the _{a}’s as the

_{A} and getting a revised set of estimates for both small areas and

The data from _{A} can be used to produce measures that evaluate small area estimates (including mean square error and interval estimates) and provide diagnostic clues if there are indications of faulty estimation. Some information may be gained by graphing small area estimates against corresponding direct estimates for areas _{A}. We can get formal measures by getting summary statistics across _{A} (or suitable partitions of _{A}) on relative biases, relative absolute biases and by comparing small area estimates of mean square error to the squared differences between small area and direct estimates. It is important also to evaluate the confidence level of small area confidence intervals. We give details on possible approaches in the succeeding text.

Our list of techniques is meant to be suggestive, not exhaustive.

Suppose {a} are the targets (truth) for areas _{A}. Let _{a}} and

If the small area estimation is working as hoped, then the average (mean) across _{A} of the relative biases _{A} has been divided into _{Ag} that we believe to have internal mean squared error homogeneity.) These quantities, and the true confidence level of confidence intervals, depend on unknowns and cannot be calculated from the sample _{1−α/2} is the 1 – _{a}. Such intervals (based on estimated mean square error rather than variance) tend to be conservative, covering _{a} at at least the nominal coverage level, provided the

We look to the validation sample to provide ‘mirrors’ (indirect information) on the aforementioned quantities and on confidence levels.

The relative bias is assayed by

The confidence interval _{a} contains _{a} if and only if _{1−α/2}, _{1−α/2}], so if we could calculate _{a}, then we could appraise the coverage by looking at the distribution of the _{a}’s across areas. But _{a} is inaccessible because _{a} is unknown. Instead, we can, for _{A}, calculate _{diff,a} should be a good indicator of the behaviour of _{a} (for more discussion, see _{diff,a} by looking at its values across the _{A} and this can provide a window into the behaviour of

_{diff,a} is the coverage _{cov} = _{diff,a}∣ ≤ _{1−α/2}), which can be taken as the average (mean) of _{diff,a}∣ ≤ _{1−α/2}) over all areas of concern (e.g. Group 1 in the example later). If this is seriously less than the nominal, it will arouse concerns about the actual behaviour of _{a}. However, _{cov} itself is inaccessible, because we only have a sample _{A} from the areas of concern. We must rely on an estimate of coverage _{A} and on _{cov}. For example suppose _{A} = 60 and _{cov} = 95% then, assuming that _{A} and _{cov} = 99%, the probability of _{A}.

We will consider variants of the _{a} = _{a} + _{a},

Here, _{a} ~ N (0; _{a}) a stochastic component, _{a}, typically assumed unknown and _{a} is fixed unknown. In the case of the Lahiri-Rao population, these components are assumed constant across areas: _{a} = _{a} = _{a} = _{a} + _{a} ≡ _{a} + _{a} with _{a} ~ _{a}) the sampling error and _{a}, _{a} independent of each other and across areas.

The sampling variances _{a} are typically assumed _{a} as known in this paper.

Then we have the estimates:

This is the original Fay-Herriot estimator of

We will use estimates of the mean square errors

We can form confidence intervals _{a} in at least (1 –

In the

In our case, the sample variances within the five groups are taken to be

We will consider four populations. In all cases, the number of areas in the five groups will be _{g} = (1 200, 800, 500, 400, 100). From each population, we take a single sample _{a} depending on which group

_{a}).

_{a} = _{a}

In all cases, we took the Lahiri-Rao model as the working model and employed the estimates for area means and for mean square error given earlier. Thus, we expect things to work well in Population 1 and possibly to misbehave in the other three populations. The question is how well our proposed diagnostics, employing data from the validation sample, reflect the underlying actual behaviour of the small area point estimates, their corresponding estimates of mean square error and the confidence intervals constructed from these.

We emphasize that none of the earlier results would be known to the analyst, because they all require knowledge of the unknown _{a}’s.

We take a single simple random sample _{A} of 60 areas from Group 1 from each of the populations, respectively. For each of the areas _{A}, we take a sample having variance _{a} = 0.4 (so intermediate to the sampling intensity in Groups 4 and 5). For each of the selected areas, we calculate the diagnostics described in

The results reflect the hidden reality of _{diff,a} for each of the 60 sampled areas. Ideally, most of the values will be spread between −2 and 2, getting sparser away from 0. This holds for Population 1. Population 2 sees a greater spread and the indication of problems is very clear for Populations 3 and 4.

The results in

For each population, we take 500 validation samples of size _{A} = 60 in Group 1 using simple random sampling. Local samples are taken with variance equal to _{a} = 0.4. For each run, summary statistics are calculated as in

In the main, the sort of indications that our single sample gave hold up across the runs.

In Population 1, none of the samples suggest anything seriously amiss with respect to bias. There is one isolated sample with coverage around 85% that might make us question our small area inferences. The mean square ratio seems the least stable of our indicators with a fair portion of samples suggesting that the mean square estimator is too small. The 95% coverage of

In Population 2, there are one or two samples that might suggest inference is okay, but by and large, the coverages reflect well that our small area inferences are doing poorly. The mean square diagnostic points in the same direction, but there is considerable overlap with what was seen for Population 1. For the bias diagnostics also, a large number of samples would not clearly delineate between a Population 1 and Population 2 situation.

Thus, there is a suggestion that the

Population 3 is unambiguous on all four diagnostics: relative bias is consistently negative, the estimated mean square error is consistently low, and the coverage gives a clear warning signal in all runs.

In Population 4, the diagnostics across runs mirror the mixed picture we saw in the population (

Current practice in small area estimation makes us vulnerable to our using very elegant and persuasive techniques that leave us in the dark as to whether they are actually working in the particular survey to which they are applied. This is a serious matter, especially because small area estimates are often used to make judgments on funding and other matters important to the body politic.

Although sporadic attempts at validation are made, they are often flawed, relying themselves on judgments that embody assumptions and speculations, as described in

In this paper, we have suggested that every small area estimation project should carry with it means for checking validity in the form of an independent sample of areas that ordinarily go sparsely sampled or unsampled and so institute REEP, a Routine External Evaluation Protocol.

The data gathered from appropriately selected small areas can in the end be incorporated into overall estimates, having served their main purpose of validating the small area estimates (Note 3 earlier).

But what if the diagnostics indicated that the model was not adequate? Should we give up on doing SAE for the problem? Not necessarily. The first step would be to try an alternative model suggested by the results of the validation study. For example if the _{A} into the model to get final estimates. If the latter, then we might have to acknowledge that in the present instance, small area estimation is failing.

The illustrative examples in

The major questions facing us in putting REEP into practice are (1) how many small areas need to be sampled in our validation sample? (2) how heavily must each area in the sample be sampled? (3) what diagnostics based on the supplementary data will be illuminating?

(1) Taking samples of 60 areas worked pretty well in the artificial populations of this paper. Taking more will give greater precision in summary diagnostics. It is desirable this question be explored further in a variety of practical settings. A particular concern will be to limit false negatives, for example low coverage using _{diff,a} when actually true coverage matches the nominal. See

Our criterion for (2) is that the areas entering into the supplementary sample should be sampled heavily enough that estimates based on the data within an area will be precise and reasonably assumed to follow a normal distribution. In the present paper, we took samples that were intermediate between those most heavily sampled in the main survey and those sampled more moderately. Again, it will be worthwhile to explore how various choices in this regard play out in practical settings.

Criterion (2) has somewhat greater importance than (1). We might still be able to learn a good deal if the number of areas sampled is lessened, but if the the samples from the areas within the validation sample are too small, our measures cannot be expected to be satisfactory.

(3) We explored various diagnostics dependent on the small area estimates and the estimates from the supplementary sample. Perhaps the most useful of these, as verifying (or not) our inferences is _{diff,a}. We anticipate that additional measures will be developed down the road;

We have not discussed many small area methods, for example Bayesian methods and quantile approaches, where doubtless some modification to the diagnostics we have suggested will be in order. But the basic REEP idea should apply to them.

Routine External Evaluation Protocol is analogous to quality control in industrial production. It carries a cost of course, one to which survey administrators may be reluctant to agree. At bottom, the cost is some sacrifice in efficiency in upper level estimates and in areas that are typically heavily sampled. Precedent for such sacrifice is testified to by the several papers cited in

Let ^{2} ≡ ^{2} + ^{2} represents the mean square error of

For (a), we seek

For (b), we want _{1−α,conv} and _{1−α,mse}

The basic message is: if we properly take into account mean square error, we get more and more conservative as the bias increases, and as the variance of the unbiased estimator shrinks. If we improperly aim only at getting variance, coverage gets weaker and weaker with larger bias and smaller sigma.

In the small area estimation context of this paper, we use neither variance nor mean square error, but rather an estimate of the mean square error. If the estimate is on target, we would be as in Table B2,

If sigma is not large, we should get a very good picture of how well the combination of our estimate and its accompanying mean square estimate are doing. The two tables are not extremes-the estimate of mean square error can be larger than the mean square error or lower than the variance; nonetheless, these tables serve as guideposts and give us an idea of what to expect.

Coverage probability arising from a conventional confidence interval, for difference of variables.

0.1 | 0.2 | 0.5 | 1 | 1.5 | 2 | 5 | 10 | |
---|---|---|---|---|---|---|---|---|

0.1 | 94.89 | 94.55 | 92.12 | 83.11 | 67.96 | 48.8 | 0.13 | 0 |

0.2 | 94.89 | 94.56 | 92.2 | 83.47 | 68.73 | 49.95 | 0.16 | 0 |

0.5 | 94.91 | 94.63 | 92.68 | 85.45 | 73.13 | 56.78 | 0.6 | 0 |

1 | 94.94 | 94.77 | 93.56 | 89.1 | 81.45 | 70.7 | 5.76 | 0 |

1.5 | 94.96 | 94.86 | 94.11 | 91.41 | 86.77 | 80.14 | 20.8 | 0.02 |

2 | 94.98 | 94.91 | 94.43 | 92.68 | 89.71 | 85.45 | 39.12 | 0.6 |

5 | 95 | 94.98 | 94.89 | 94.56 | 94 | 93.22 | 83.47 | 49.95 |

10 | 95 | 95 | 94.97 | 94.89 | 94.74 | 94.55 | 92.12 | 83.11 |

Coverage probability arising from a confidence interval based on mean square error, for difference of variables.

0.1 | 0.2 | 0.5 | 1 | 1.5 | 2 | 5 | 10 | |
---|---|---|---|---|---|---|---|---|

0.1 | 95 | 95 | 95.1 | 96.15 | 97.88 | 99.12 | 100 | 100 |

0.2 | 95 | 95 | 95.1 | 96.11 | 97.81 | 99.07 | 100 | 100 |

0.5 | 95 | 95 | 95.07 | 95.84 | 97.37 | 98.71 | 100 | 100 |

1 | 95 | 95 | 95.03 | 95.39 | 96.37 | 97.62 | 99.99 | 100 |

1.5 | 95 | 95 | 95.01 | 95.16 | 95.67 | 96.54 | 99.87 | 100 |

2 | 95 | 95 | 95 | 95.07 | 95.32 | 95.84 | 99.48 | 100 |

5 | 95 | 95 | 95 | 95 | 95.01 | 95.04 | 96.11 | 99.07 |

10 | 95 | 95 | 95 | 95 | 95 | 95 | 95.1 | 96.15 |

t-Values differences across a validation sample of 60 areas from each of 4 populations.

Populations 1. Distributions of four diagnostics over 500 runs each a sample of size n_A = 60.

Populations 2. Distributions of four diagnostics over 500 runs each a sample of size n_A = 60.

Populations 3. Distributions of four diagnostics over 500 runs each a sample of size n_A = 60.

Populations 4. Distributions of four diagnostics over 500 runs each a sample of size n_A = 60.

Frequency of counties having effective sample size in recent U.S. National Health Interview Survey.

Effective number of sampled units in area | 0 | (0,100) | [100,300) | [300,600] | (600,900] | >900 |

Frequency | 2 307 | 497 | 251 | 68 | 11 | 9 |

Quantiles of size variable x for population 4.

Minimum | 25.00% | 50.00% | 75.00% | Maximum |
---|---|---|---|---|

0.04 | 0.52 | 1 | 2.01 | 39.11 |

Summary statistics of small area estimates for 1 200 areas lacking sample in group 1 of 4 populations.

% Relative Bias | % Relative absolute Bias | Mean Estimated mse/Mean mse | Nominal 95% coverage | Nominal 99% coverage | |
---|---|---|---|---|---|

Pop1 | −0.1 | 4.99 | 1.08 | 95.84 | 99.09 |

Pop2 | 0.96 | 10.17 | 0.27 | 69.92 | 82.5 |

Pop3 | −11.25 | 11.34 | 0.21 | 48.92 | 74.25 |

Pop4 | 234.34 | 236.42 | 1.06 | 98.83 | 100 |

Summary statistics for small area statistics relative to validation values for a sample of 60 areas in group 1 in each of 4 populations.

Diag % Rel Bias | Diag % Rel Abs Bias | Diag Mean estimated mse | _{diff,a}| ≤ _{.975} | _{diff,a}| ≤ _{.995} | |
---|---|---|---|---|---|

Pop1 | −0.51 | 5.41 | 1.02 | 91.67 | 98.33 |

Pop2 | 2.76 | 11.99 | 0.26 | 70 | 83.33 |

Pop3 | −10.84 | 10.84 | 0.29 | 63.33 | 86.67 |

Pop4 | 2 909.26 | 3 631.94 | 1.08 | 98.33 | 100 |