In this paper we propose a method for statistical disclosure limitation of categorical variables that we call Conditional Group Swapping. This approach is suitable for design and strata-defining variables, the cross-classification of which leads to the formation of important groups or subpopulations. These groups are considered important because from the point of view of data analysis it is desirable to preserve analytical characteristics within them. In general data swapping can be quite distorting ([

Statistical agencies have an obligation by law to protect privacy and confidentiality of data subjects while preserving important analytical features in the data they provide. Privacy and confidentiality are not guaranteed by removal of direct identifiers, such as names, addresses and social security numbers, from the microdata file. Re-identification of individuals in the data is still possible by linking the file without direct identifiers to external databases. That is why in addition to the removal of direct identifiers, released microdata are typically modified, in order to make disclosure more difficult; that is, statistical disclosure limitation (SDL) methods are applied to the data prior to their release. The goal of such a modification is two-fold: to reduce the risk of re-identification and at the same time to preserve important distributional properties of the original microdata file. Although it is not possible to know all the uses of the data beforehand, some of the relationships of interest to the user may be known. For example, some surveys oversample particular groups of individuals with the goal of obtaining better estimates for these groups. This requires special sample design and allocation of additional funds to obtain bigger samples for these groups. It would be particularly undesirable and counterproductive if SDL methods significantly change the estimates within these groups and/or considerably increase their standard errors. So every scenario of data release is different and disclosure limitation methods should be chosen accordingly. In this paper, we have focused on the situation of data release when the data protector has to modify categorical variables that define strata or subpopulations, but at the same time wants to minimize the distortion to the analytical structure within these strata.

To accomplish his/her task, the data protector can choose from among a wide variety of methods which can be divided in two groups: masking methods which release a modified version of the original microdata, and synthetic methods which generate synthetic records or values for specific variables from the distribution representing the original data.

A few examples of masking methods are: additive or multiplicative noise [

To measure the utility of masked data, the data protector can use either analysis-specific utility measures, tailored to specific analyses, or broad measures reflecting global differences between the distributions of original and the masked data [

First, let us recall the definition of a propensity score. The propensity score is the probability that an observation _{i}. We denote _{i} - the values of the variables for this record. Propensity scores can be estimated via a logistic regression of the variable _{i} given the propensity score (see [

In this paper, we have focused on a non-synthetic approach for disclosure limitation suitable for categorical strata-defining variables, the cross-classification of which leads to the formation of important groups for a data analyst. We present the Conditional Group Swapping method designed to minimize the distortion incurred by swapping, to the relationships between the variables, particularly those that involve categorical strata-defining variables. The idea of the method is described in

In this section we describe the algorithm of our Conditional Group Swapping approach, hereafter, abbreviated as CGS. Below are the main steps of the method.

Compute pairwise distances between all the strata using the propensity score metric (1) described in

Compute swapping probabilities, that is the probabilities of moving records from one stratum to another, for the records in two closest strata. This will be done as follows. Suppose the distance between stratum _{s} be the desired swapping rate, that is the number of records that will be moved from one stratum to another. To compute the swapping probabilities, first combine together all the records from _{i}. In other words we compute the propensity scores, denoting them as _{AB}(_{i}).

Select _{s} records from stratum _{AB}(_{i}) and change their stratum indicator to _{s} residential hospitals we will change their hospital type indicator to multi-service.

Select _{s} records from stratum _{AB}(_{i}) and “move” them to stratum

Repeat steps 3 and 4 for another pair of strata with the next closest distance.

Repeat step 5 until there are no strata that have not been swapped.

The procedure described above was implemented and evaluated on several data sets. We experimented with genuine and simulated data. In this section we present only the results obtained on two genuine data sets. Simulated data results were very similar, so we omit them for brevity of the exposition. Below is the description of the two genuine data sets we used.

The Titanic data is a public data set that was obtained from the Kaggle web-site [

The 1998 Survey of Mental Health Organizations (abbreviated as SMHO). This sample contains 874 hospitals. It is publicly available and can be obtained from the PracTools R package [

We applied the approach described in

Based on these considerations, we divided the data in six strata according to the cross-classification of the variables Pclass and Sex: 1) first class male passengers, 2) first class females, 3) second class males, 4) second class females, 5) third class males, 6) third class females.

The first step of the CGS procedure identified the following strata as closest: first class males and first class female, second class males and second class females, and third class males and third class females. For the measure of distance between the distributions of different strata (specifically, between the multivariate distributions of Survived, Age, Fare, SibSp and Parch for each stratum), we used the following model to estimate propensity scores: the main effects for the variables Survived, Age, Fare, SibSp and Parch and the interactions between Survived and Fare, Survived and Age, Survived and SibSp, Survived and Parch. We didn’t include all the main terms and interactions because otherwise the totality of the estimated parameters would not be supported by the sample size.

Because the goal of our experiments is to test the potential benefits of using conditional probabilities for swapping and more specifically to estimate the effect of such probabilities on the quality of different statistical estimates, we compared the outcome of Conditional Group Swapping to the outcome of a similar approach which is characterized by uniform swapping probabilities. For the later approach the values of the variables of the records do not influence the probabilities of these records being swapped. We call it Random Group Swapping, hereafter, abbreviated as RGS. In a sense, RGS reflects the idea of the traditional approach for swapping. To make a fair comparison and to estimate the effect of using conditional swapping probabilities, RGS and CGS were implemented in the same way (as described in

We experimented with two swapping rates: _{s} = 20 and _{s} = 40 records exchanged between the strata. This corresponds respectively to about 15 and 35 percent of records swapped for each stratum. For each swapping rate, we generated 100 realizations of swapped data using Random and Conditional Groups Swapping.

Next, we compared the results of several statistical analyses based on the original and swapped data. One of them was logistic regression fitted to the complete Titanic data with Survived as the predicted variable and Pclass, Sex and Age as predictors. Hereafter, we will use R notation for the models. For the aforementioned regression it will be: Survived ~ Pclass+Sex+Age. Denote this model Reg1. We used this set of predictors in Reg1 because they were identified as being statistically significant based on the original data.

We also fitted logistic regressions within each stratum: Survived ~ Age + Fare. Denote this model Reg2

Next, we compared confidence intervals of regression coefficients for these regressions based on the original and swapped data. There were five regression coefficients for Reg1, including intercept, coefficient for Age, coefficients for dummy variables Pclass=2, Pclass=3 and for Sex=male and three regression coefficients for Reg2 (intercept and coefficients for Age and Fare).

As a measure of comparison we used the relative confidence interval overlap similar to the one used in [_{orig,k}_{orig,k}) and (_{swap,k}_{swap,k}) be the lower and upper bounds for the original and masked confidence intervals for the coefficient _{over,k} = max(_{orig,k}_{swap,k}) and _{over,k} = min(_{orig,k}_{swap,k}). When the original and masked confidence intervals overlap, _{over,k}
_{over,k} and (_{over,k}_{over,k}) represent the lower and the upper bounds of the overlapping region. When these confidence intervals do not overlap, _{over,k}
_{over,k} and (_{over,k}_{over,k}) represent the upper and the lower bounds of the non-overlapping region between these intervals. The measure of relative confidence interval overlap for the coefficient

When confidence intervals overlap, _{k} ∈ (0, 1] and _{k} = 1 when the intervals exactly coincide. In case one of the confidence intervals is “contained” in the other, the relative confidence interval measure will capture such a discrepancy, and 0 < _{k} < 1. When intervals don’t overlap, _{k} ≤ 0. In this case, _{k} measures non-overlapping area (between the intervals) relative to their lengths. We also report an average confidence interval overlap over all the coefficients defined as

_{k} over all 100 realizations and all the coefficients. Range of variation is reported for the central 90% of the distribution of _{k}. Column “# non-over” displays the fraction of times the intervals didn’t overlap over all the realizations and coefficients. For example, 100_{k} < 0). For Reg1 the number of computed intervals is 500 = 100 realizations × 5 coefficients; and for Reg2 it is 1800 = 3 coefficients × 6 strata × 100 realizations

As can be seen from the table, the average confidence interval overlap _{s} = 20 and 40 respectively). Moreover, the values of

Regarding individual coefficients overlap measures _{k}, we observed that they were similar in values for different coefficients, except the coefficient for Sex. In particular, the average _{k} values over 100 realizations were smaller for Sex than for other coefficients (it was equal to 0.5 for CGS). Confidence intervals for Sex overlapped for all 100 realizations of CGS for the swapping rate 20. However, confidence intervals for Sex never overlapped for RGS. There is an explanation to that. In particular, in both cases swapping was done between the strata which were identified as closest to each other. The closeness was estimated for the multivariate distribution of Survived, Age, Fare, SibSp and Parch. The closest strata happened to be the ones that have the same passenger class Pclass but different Sex,

In addition to confidence interval comparisons, we also computed the element-wise ratios of original and swapped data means and covariance matrices for numerical variables Age, Fare, SibSp and Parch within each stratum. The results of these comparisons are presented in

For our second data set, SMHO, we fitted a logistic regression of Find-irct (hospital receives money from the statement health agency) on all other variables, denote it Reg3 and a regression of Exptotal (total expenditures in 1998) on all other variables, denote it Reg4. Both regressions were fitted to the complete data. Within strata, analyses included regressions: Findirct on all other variables (Reg5) and Exptotal on all other variables (Reg6). Hospital type was not included in the predictor set of Reg5 or Reg6 because it was the same value for all the records in a particular stratum. The results are presented in

As can be seen from Tables

In this paper we presented a Conditional Group Swapping method suitable for categorical variables which define strata or subpopulations. This swapping method is designed with the goal to reduce the damage incurred by the disclosure limitation to the relationships between the variables within the strata and in the overall data. Our experimental results showed that the method has the potential to better preserve inferential properties, such as confidence intervals for the regression coefficients specific to particular strata and for the overall data, than Random Swapping. For numerical variables the means and covariance matrices within the strata are less distorted as well.

We believe that in practice CGS should not be the only method that is applied to the data, especially if there are continuous variables in the data. Similar to other swapping approaches, CGS can be used together with other SDL methods. For example, one can apply Conditional Group Swapping to strata-defining variables and then add multivariate noise

Another direction for future research is the investigation of the risk associated with the method. We believe that the risk assessment is more comprehensive and practically useful when done for the final version of the masked data, which, as we noted above, will result from the application of our Conditional Group Swapping together with other SDL methods. Indeed, if there are continuous variables in the data and they are not masked, then re-identification risk can be high regardless of the protection of categorical variables, because the values of continuous variables are virtually unique. CGS method is not suited for continuous variables, however, as mentioned above, it can be used in combination with additive noise. So, we carried out several experiments with this combination, in particular we applied it to both out data sets. The value

Next, we estimated the re-identification disclosure risk, defined as an average percentage of correctly identified records when record linkage techniques [

The re-identification disclosure risk for the Titanic data masked with multivariate noise and CGS was low: about 4% of all records were correctly identified for _{s} = 20 and about 3% for _{s} = 40. For SMHO data the risk was even lower, it was about 2% for _{s} = 20 and 1.5% for _{s} = 40.

As we mentioned above, these experiments do not represent a comprehensive risk analysis, however, they give an idea of the magnitude of risk. Thorough investigation of the disclosure risk for the combination of Conditional Group Swapping together with different SDL methods is the topic of our future research.

The authors would like to thank Alan Dorfman and Van Parsons for valuable suggestions and help during the preparation of the paper. The findings and conclusions in this paper are those of of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

The Titanic data results: original and masked con_dence interval overlaps.

CGS | RGS | ||||||
---|---|---|---|---|---|---|---|

rate | Average | Range | # non-over | Average | Range | # non-over | |

Regl | 20 | 0.88 | [0.6, 0.99] | 0/500 | 0.52 | [−0.53, 0.95] | 100/500 |

40 | 0.65 | [−0.15, 0.96] | 51/500 | 0.16 | [−1.86, 0.92] | 106/500 | |

Reg2 | 20 | 0.85 | [0.60, 0.98] | 1/1800 | 0.76 | [0.42, 0.97] | 1/1800 |

40 | 0.79 | [0.41, 0.97] | 0/1800 | 0.69 | [0.28, 0.95] | 3/1800 |

The Titanic data results: ratios of means and ratios of covariance matrices based on the original and masked data.

CGS | RGS | ||||||
---|---|---|---|---|---|---|---|

rate | Average | Range | # sign change | Average | Range | # sign change | |

Mean ratio | 20 | 1.003 | [0.91, 1.10] | N/A | 1.002 | [0.81, 1.21] | N/A |

40 | 1.004 | [0.87, 1.15] | N/A | 1.02 | [0.72, 1.37] | N/A | |

Cov. ratio | 20 | 1.02 | [0.61, 1.67] | 208/9600 | 1.03 | [0.51, 1.62] | 238/9600 |

40 | 1.5 | [0.52, 1.68] | 238/9600 | 0.99 | [0.34, 1.68] | 364/9600 |

The SMHO data results: original and masked con_dence interval overlaps.

Conditional Swap | Random Swap | ||||||
---|---|---|---|---|---|---|---|

rate | Average | Range | # non-over | Average | Range | # non-over | |

Reg3 | 20 | 0.91 | [0.72, 0.99] | 0/900 | 0.72 | [0.2, 0.99] | 2/900 |

40 | 0.84 | [0.84, 0.99] | 0/900 | 0.64 | [−0.29, 0.97] | 100/900 | |

Reg4 | 20 | 0.94 | [0.81, 1] | 0/900 | 0.84 | [0.58, 0.97] | 0/900 |

40 | 0.92 | [0.77, 0.99] | 0/900 | 0.71 | [0.26, 0.95] | 11/900 | |

Reg5 | 20 | 0.85 | [0.56, 0.99] | 5/2500 | 0.64 | [−0.25, 0.97] | 196/2500 |

40 | 0.82 | [0.45, 0.98] | 14/2500 | 0.47 | [−0.7, 0.95] | 393/2500 | |

Reg6 | 20 | 0.81 | [0.45, 0.98] | 0/2500 | 0.72 | [0.26, 0.96] | 25/2500 |

40 | 0.73 | [0.15, 0.97] | 71/2500 | 0.61 | [−0.06, 0.94] | 158/2500 |

The SMHO data results: ratios of means and covariance matrices based on the original and masked data.

Conditional Swap | Random Swap | ||||||
---|---|---|---|---|---|---|---|

rate | Average | Range | # sign change | Average | Range | # sign change | |

Mean ratio | 20 | 0.99 | [0.85, 1.09] | N/A | 0.94 | [0.56, 1.17] | N/A |

40 | 0.96 | [0.63, 1.12] | N/A | 0.92 | [0.38, 1.32] | N/A | |

Cov. ratio | 20 | 1.03 | [0.57, 1.46] | 0 | 0.87 | [0.05, 2.07] | 276/8000 |

40 | 0.90 | [0.34, 1.49] | 334/8000 | 0.79 | [0.016, 2.88] | 332/8000 |