One of the most challenging problems for national statistical agencies is how to release to the public microdata sets with a large number of attributes while keeping the disclosure risk of sensitive information of data subjects under control. When statistical agencies alter microdata in order to limit the disclosure risk, they need to take into account relationships between the variables to produce a good quality public data set. Hence, Statistical Disclosure Limitation (SDL) methods should not be univariate (treating each variable independently of others), but preferably multivariate, that is, handling several variables at the same time. Statistical agencies are often concerned about disclosure risk associated with the extreme values of numerical variables. Thus, such observations are often top or bottom-coded in the public use files. Top-coding consists of the substitution of extreme observations of the numerical variable by a threshold, for example, by the 99th percentile of the corresponding variable. Bottom coding is defined similarly but applies to the values in the lower tail of the distribution. We argue that a univariate form of top/bottom-coding may not offer adequate protection for some subpopulations which are different in terms of a top-coded variable from other subpopulations or the whole population. In this paper, we propose a multivariate form of top-coding based on clustering the variables into groups according to some metric of closeness between the variables and then forming the rules for the multivariate top-codes using techniques of Association Rule Mining within the clusters of variables obtained on the previous step. Bottom-coding procedures can be defined in a similar way. We illustrate our method on a genuine multivariate data set of realistic size.

Many national surveys conducted by government agencies have a large number of attributes of different types. Some examples of such surveys in the USA are the National Health Interview Survey [

Records that have extreme or very large values of numerical attributes are often a subject of concern about disclosure risk associated with these values. One way of addressing such a risk is to top code numerical attributes which are considered as ”visible” or possibly known from other publicly available data sources and which are not a subject to very frequent variation. For example, a person’s height can be top-coded to 75 inches, so all the individuals who are taller than 75 inches are recorded in the category “75 inches and above”. Such top-coding thresholds are chosen by the data protectors. Typically these thresholds are the estimates of the upper percentiles of the corresponding variable, for example, 95th, 97th, or 99th percentiles.

However, when top-coding thresholds are determined independently of other variables, protection may be inadequate for some groups of individuals. For example, assume the attribute weight is top-coded to 300 pounds for all the respondents. However, a female respondent with such a top-coded weight whose race/ethnicity is Asian could be more extreme as opposed to a respondent with the same weight who is a white male [

The main contribution of the paper is a new multivariate top-coding procedure which is based on clustering variables and using techniques of Association Rule Mining (ARM) [

Assume there is a microdata set _{1}, ⋯, _{k}} ∈ _{i} that is selected for top-coding. Next, we perform the search of the sub-populations that should have special top-codes for _{i} within the vertical partition corresponding to the cluster of variables around _{i}.

In [_{i} and _{j}.

In [

A better way to group the variables for multivariate top-coding is to include in each cluster only the closest variables to _{i}, which are no further than 1 − _{i}. The cut-off value _{i} ∈ {_{1}, ⋯, _{k}}, the search of sub-populations that require special top-codes for each of _{i} will be done within the corresponding cluster. To accomplish this search we propose to use Association Rule Mining (ARM), a popular machine learning rule-based methodology for discovering interesting relationships between the variables. There are several reasons why we decided to use ARM. First, the problem of multivariate top-coding, as we outlined it above, can be expressed as a search of association rules for variables _{i}. An association rule [_{i}, _{j}, ⋯ _{f} are the variables from the data set _{f}, _{f}] are specific intervals within the domains of the corresponding continuous variables. In the paper we call the antecedent of the rule,

The association rules that we are proposing for multivariate top-coding are of the form:

For example, (

Another reason for using ARM is that these techniques are designed to work well for large data bases. ARM algorithms are implemented in many software packages, including R.

Association rule

A confidence of the rule is defined as the proportion of the records in

The standard Apriori [

Mining association rules on both categorical and numerical attributes, often called mining quantitative association rules, have been covered significantly less in the literature. There is no method that is considered a “gold standard” for quantitative association rules. The difficulty of mining these rules stems from the fact that numerical attributes are usually defined on a wide range of different values. It’s not practical to work on all possible numeric values, as is done for categorical values, because in most cases, there are many such values and each numeric value does not appear frequently.

In [

For categorical variables QuantMiner computes frequent itemsets similar to Apriori; that is, finds frequently occurring instantiations

The algorithm starts with an initial population of rules for each rule template. Different rules in the initial population have different intervals for continuous variables, randomly chosen within their domains. In the following generations, the intervals are subject to change by genetic operators of mutation and crossover [

The fitness function used in QuantMiner is proportional to the Gain of the rule ([_{j} is the ratio of the interval length to the length of the domain of _{f}.

Let _{i}. For example, if _{i} serves as a top-code threshold for this variable. For each variable _{i}, let _{i} be the cluster of variables that contains _{i}. We propose the following procedure to determine which sub-populations may need special top-codes (that is, lower than the rest of the population) for each variable _{i} in

Compute the _{i} using all the records in the data set. Denote this marginal percentile as _{i}.

Mine the following type of association rules on the vertical partition of the data that corresponds to the cluster of variables _{i}:
_{i}, in the form given by expression (_{i} and the percentile for a particular sub-population that should “get” its own top-coding threshold, different from _{i}. _{i}.

Choose the rules with the confidence equal to _{i} is at most _{i} −

For each subpopulation defined by the LHS of the rules mined on the previous step, compute the _{i} using the records that belong to these subpopulations. The computed percentiles may serve as the top-codes for these subpopulations.

To find quantitative association rules (step 2 of the procedure above) we used a modified QuantMiner procedure: we changed the way how interval boundaries of numerical variables that appear on the LHS of the rules are calculated. We also changed the form of the fitness function. Regarding the calculation of interval boundaries, in the original version of QuantMiner both ends of the intervals are subject to change by the operators of crossover and mutation and the shortest intervals are being sought. However, we fixed the lower end of the intervals at the minimal value of the domain for those numerical variables that appear on the LHS of the rule and are positively correlated with the top-coded variable _{i}. If the numerical variable on the LHS of the rule is negatively correlated with _{i}, then the lower end of the interval is subject to change and the upper end is fixed. This is done in order not to exclude the individuals with values of numerical variables close to the boundaries of the domain from protection by top-coding who should otherwise be protected. For example, assume the variable

As mentioned above, we also modified the form of the fitness function. Contrary to [_{f} on the LHS of the rule, subject to the resulting rule satisfying minimal confidence and minimal support.

The reason of this modification is again not to exclude any individuals from protection that otherwise should be protected. Indeed, larger intervals typically correspond to larger groups of individuals having values of numerical variables within these intervals. Thus, the largest intervals on the LHS of the rule in our algorithm define the largest sub-population for which expression (_{i} than the top-codes for rest of the population.

Finally, it is important to note that the procedure of bottom-coding is a straightforward conversion of the top-coding procedure described above.

We applied our approach of multivariate top-coding to a genuine multivariate data set that was downloaded from the UCI Machine Learning Repository [

To illustrate our approach we choose the variables

The cluster of variables around

The default minimal support of the rules in QuantMiner is set up to be 10%, but in our experiments, we lowered the minimal support to 1% in order to be able to identify small sub-populations (of the size of 1% of the data set or larger) which may require their own top-codes. For the data set of this size, it means that the size of these sub-populations should be at least 18, 000. The main constraint on lowering support of the rules is the computational burden, because many more subpopulations need to be checked, and, as a consequence, many more potential rules should be tested by the algorithm.

It should be noted that the main purpose of the proposed procedure is to assist the data protector in the otherwise daunting task of going through the large number of possible combinations of the relevant attributes in a big data set in order to find rarely observed extreme observations of top-coded variables for certain groups of records or sub-populations. These sub-populations are usually associated with lower values of the numerical variables subject to top-coding. Our rules are meant to bring such special cases to the data protector’s attention. However, the decision about whether to use these rules to apply top-codes or not depends on many factors, such as a particular scenario of data release, SDL practice at a particular institution, and preferences of data protectors. In any case, such decisions are usually made together with the subject area specialists. Furthermore, some of the rules may be obvious, or they may be always observed in the data; for example, the rules that have confidence equal to 100%. Thus, not every automatically mined rule should imply top-coding. In some instances, the rules that have confidence equal to 100% may be used with the goal to check and find incorrectly recorded observations or the values that are not plausible.

Due to space limitation, below we present a selection of rules for

Some of the rules presented above seem intuitive or common sense. One example of such rules are those that have income on the RHS and

Another example of rules that are intuitive are the rules that involve

Rules that include a combination of the following three characteristics: marital status =“Never married” combined with zero or small values of social security income in the previous year (variable

Another characteristic that is related to income is the occupation of the respondent (

As expected, rules that included the variable

We conclude this section by emphasizing that the focus of the paper is not the discussion and analysis of particular rules, but the development and description of the methodology to obtain such rules. Deeper analysis of the rules obtained by our procedure should be done by the data protector and subject area specialist for each particular data set and the scenario of data release.

In this paper we propose a new approach for multivariate top-coding for disclosure limitation in large databases with many attributes of different types. We outlined an automated procedure that can help the data protector to find subpopulations that may need their own top-codes, lower than the rest of the population. Such a procedure may be used as an aid for the data collecting organizations in the disclosure review process as an alternative, or in addition, to their regular procedures. Such procedures often involve identification of risky combinations of the variables, which is often based on intuition as well as knowledge of a particular data set. In big data sets these procedures may be complicated and computationally involved as they require computation of many tabulations to identify potentially rare/risky combinations of the categories of these attributes. Thus, an automated procedure to identify such cases can be helpful especially when the data protector intends to release data sets with many attributes of different types, such as big government surveys.

To reduce the complexity of the problem we outlined a two-step approach which consists first of clustering the variables around the top-coded variables, using squared canonical correlations, then running our association rule mining algorithm on a vertical partition of the data that consist of the variables that are in the same cluster with the top-coded variables. This two-step approach makes association rule mining and the subsequent work with the rules by subject area specialists computationally feasible.

We would like to note that the association rules found by the proposed approach are meant to bring to the data protector’s attention particular combinations of the attributes that are rarely associated with the extreme values of the numerical variable that is subject to protection. Data protectors can choose topcoding or some other technique for protection of these groups of individuals. For example, synthesis can be used to impute safer values of numerical attributes.

Our future work consists of finding efficient ways for further reduction of the number of association rules. Another direction of future research is to investigate possible ways of incorporation of the data protector’s preferences and knowledge in the algorithm. For example, certain individual characteristics are more visible or noticeable than the others; for instance, amputations/missing limbs, walking aids and some others. So, we will investigate the best way of weighting the variables/characteristics on the clustering step and the association rule mining algorithm as well.

The findings and conclusions in this paper are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. The first author would like to thank Ellen Galantucci from the Bureau of Labor Statistics for the helpful discussion on the content of