Multi-sample U-statistics encompass a wide class of test statistics that allow the comparison of two or more distributions. U-statistics are especially powerful because they can be applied to both numeric and non-numeric data, e.g., ordinal and categorical data where a pairwise similarity or distance-like measure between categories is available. However, when comparing the distribution of a variable across two or more groups, observed differences may be due to confounding covariates. For example, in a case-control study, the distribution of exposure in cases may differ from that in controls entirely because of variables that are related to both exposure and case status and are distributed differently among case and control participants. We propose to use individually-reweighted data (i.e., using the stratification score for retrospective data or the propensity score for prospective data) to construct adjusted U-statistics that can test the equality of distributions across two (or more) groups in the presence of confounding covariates. Asymptotic normality of our adjusted

U-statistics^{1,2} are widely used to compare the distribution of a random variable of interest across two or more groups. An appealing feature of U-statistics is that they often rely only on symmetry. For example, given that the distributions of a variable across two or more groups are the same, if the data are pooled and then ranked, we would expect that the average rank of observations from each group should be the same; this forms the basis of the Wilcoxon rank sum test. U-statistics are very general, and can be used for non-numeric data, e.g., ordinal and categorical data where a pairwise similarity or distance-like measure between categories is available. For example, in

When group membership is randomly assigned, we are certain that any difference we observe between the groups must be due to differential treatment of the groups after randomization. For example, after randomization, one group may be given an active drug and another group a placebo. In this case, differences in medical outcome can be attributed to the effect of the drug. However, when group membership is not assigned through randomization, there may be confounding covariates that can cause a spurious association between outcome and group membership. Specifically, if there are covariates that influence both group membership and the outcome variable we are comparing, then an observed difference in the distribution of outcome variable across groups may be due to a difference in the distribution of these covariates across groups. For example, in a study that compares lung capacity among persons who consume alcohol and persons who abstain from alcohol, an observed difference may be due to the presence of more smokers among the persons who consume alcohol. In the haplotype example, the genetic ancestry of cases may differ systematically from controls. In the data from African-Americans we consider, the proportions of African or European ancestry will affect the distribution of haplotypes found in each group. If African-Americans have a different risk of disease than persons of purely European ancestry, then genetic ancestry is a confounder and must be accounted for in the analysis.^{3,4}

The usual approach to account for confounding covariates is to model their effect on the outcome of interest using a regression approach. While this direct approach is very useful, it requires a test that can be formulated in a regression setting. Unfortunately, U-statistics are typically not related to regression procedures. As a result, direct adjustment is problematic. For example, it is unclear how the direct approach could be applied in the haplotype example, where the outcome (similarity or sharing) is only defined for pairs of haplotypes. A related approach is to regress the outcome on covariates and then form a U-statistic from the residuals of this regression. This approach is only valid for linear regression, and is limited to the situation where the outcome variable is numeric.

Here we take an alternative approach based on the stratification score^{5,6} or the propensity score.^{7,8} We model the probability of group membership as a function of confounding covariates, typically using logistic regression for two groups or polyto-mous logistic regression for more than two groups. Then, we inversely weight the sample according to the probability of group membership.^{9} Under the null hypothesis of no group effect on the outcome, the weighted outcome distributions should be the same.^{5,10} We then construct adjusted U-statistic tests based on these weighted distributions. Since reweighted sample means and estimating equations based on propensity scores have been used in the context of causal inference, it can be anticipated that a similar approach may work for a U-statistic. Although the inverse weighting approach to account for confounding has been proposed over years, it has not been well established in the non-parametric field other than the work by Jiang et al.^{11,12} and Rosen-baum.^{13} Jiang et al.^{11,12} proposed a propensity-score adjusted generalization of Kendall’s Tau for estimating the association between genotype and trait in a single population; Rosenbaum^{13} proposed a new family of U-Statistics for comparing matched pairs. However, a formal treatment for a general kernel of such U-statistics does not seem to be available. We demonstrate, both theoretically and through simulations, that this approach works not only for propensity scores but also for stratification scores in retrospective studies. More importantly, we obtain a closed form of asymptotic variance estimator for our reweighted U-statistics, which is a novel and useful contribution of our current work. We also generalize our proposed U-Statistics to compare multiple groups. Because this variance estimator is somewhat complex, we have made an R-code available that implements our approach for two- and multi-sample tests.

The rest of the paper is organized as follows. In

To develop adjusted U-statistics that account for confounding covariates, we adopt a marginal approach that standardizes the data by weighting observations so that the distribution of confounding covariates is the same in each group.^{5,15} Assume that the _{i}, and let _{i} denote the outcome variables with realization _{i}. We let _{g} denote the total number of observations from group _{1},_{2},⋯, _{B}), _{1},_{2},⋯, _{D}), _{1} is the number of observations having _{2} is the number of observations having _{1} < _{2} is _{1} has the same distribution as _{2}. In the presence of confounding covariates

To develop such a test, consider comparing the CDF of

Allowing ^{14}) for consistency and asymptotic normality of the maximum likelihood estimator of _{i} (^{18} Once

Assume that we fit model

Returning to the problem of estimating the distribution of _{i}_{i};

Choosing ^{5} thus data in group 1 are not reweighted in

Although

Finally, under the null hypothesis that the distribution of _{1} is the number of observations for which _{2} is the number of observations for which

Motivated by the standardized (weighted) CDF estimators just described, we propose the following two-sample adjusted

Comparing _{a} is closely related to the standard

We assume that the second moment of this adjusted kernel is finite. We show in the appendix that it is possible to develop a linear approximation of _{a} using the projection approach. In particular, we show that _{a} has an asymptotically normal distribution, and that the asymptotic variance of _{a} is consistently estimated by _{g} is the number of observations having _{i}

Here

In the above equations, _{n} are defined in equations (_{a} −

To account for comparisons involving more than two groups, we consider a vector of two-sample U-statistics ^{th} component of

Because each component _{i} (now a vector) calculated among those observations having _{i}
_{a} has a ^{2} distribution with degrees of freedom given by the rank of the matrix

If the alternative hypothesis is that groups are ordered in their response, then we choose ^{17} we choose

Asymptotically, _{a} has a standard normal distribution. If the direction of the ordering is known a priori, a one-sided p-value may be used.

To demonstrate the general properties of our test, we used data on three groups simulated using the model _{1}, _{2}, _{3}, _{4}), where _{1} is the intercept, _{2} ~ _{2} = (−2, 0.15, 0.2, 0.1) and _{3} = (−2.5, 0.3, 0.4, 0.2). We chose

To confirm the asymptotic normality of our adjusted U-statistics, we generated 1,000 data prospectively from model (^{19} investigated the effects of misspecification of the propensity score on estimators of treatment effect, and conclude that the bias of the estimator of the treatment effect is large if the covariates are omitted. The naive approach (i.e., the unadjusted U-Statistics) can be considered as a special case of the propensity score models with all covariates omitted. The results presented in ^{19} indicating that the naive approach has a large bias for assessing the treatment effect.

Next, to investigate power, we considered simulating from the model
_{2}, _{3}) = _{2}, _{3}) = (0,0) calculated using the R package VGAM.^{18} The results are given in _{2}, _{3}) = (0,0) for the same model as specified in equation (

To examine the power of the Wald test and U-statistics based test when the regression model is mis-specified, we carried out the exactly same simulations as above except that the simulated data were generated from the following model:

As before, the Wald test is based on the regression model specified in (

To illustrate the wide variety of analyses that can be done with the adjusted U-statistics we describe here, we analyze data on the association between genetic haplotypes and the risk of schizophrenia in African Americans. Haplotypes (i.e., the adjacent alleles that were contributed by the same parent, e.g. the adjacent paternally-derived alleles) in the catechol-O-methyltransferase (COMT) gene have been associated with Schizophrenia in an Ashkenazi population,^{20} and deletions of the region containing COMT cause velocardiofacial syndrome, a syndrome that is associated with a high rate of schizophrenia.^{21} Here we test the hypothesis that haplotypes of COMT are associated with schizophrenia using data from the GAIN network study of Schizophrenia, a genome-wide association study with data from 885 African-American case participants and 830 African-American control participants.

Because genotypes, not haplotypes, are observed, it is necessary to exercise some care when making inference about haplotypes. Here we avoid these issues by comparing the similarity between the haplotypes of a case and a control participant to the similarity of haplotypes between two control participants. While it would seem that the unobserved haplotypes are required to measure this similarity, Tzeng et al.^{22} showed that the “counting measure” which compares the number of alleles that the haplotypes in one person share in common with the haplotypes of another person, can be calculated using genotype data alone. The similarity Δ_{ij} between the

For this analysis, we define the region of interest when calculating similarity to be the 15 SNPs that are genotyped in these data and lie between rs737865 and rs165599 inclusive (the region identified by Shifman et al.^{22}). Our null hypothesis is that the distribution of haplotypes among case participants is the same as that among control participants; the alternative is that case participants have a different haplotype distribution, implying that COMT haplotypes are risk factors for schizophrenia.

Unlike the Ashkenazi population, African-Americans are genetically heterogeneous, and individuals vary in their proportion of African and European ancestry. Ancestry is a confounder because it affects both haplotype frequencies and the risk of disease. While ancestry is typically unmeasured, it is well established^{3,4} that principal components of genotype data can be used to control for confounding by ancestry. The details of the calculation of these confounding covariates in the GAIN schizophrenia study is described in Allen and Satten,^{5} who concluded that 3 principal components were sufficient to control for confounding by population stratification. Thus, we adjust for ancestry using principal components as confounding covariates when calculating the U-statistic just described.

To confirm the performance of our method with the higher-order kernel (_{g}, and _{g} = 0. In this scheme, the association between

To ensure that the kernel (_{g} increases, cases will be more likely to have larger values of _{g} = 0.5, 1.0 and 1.5 is presented in _{g}. It is clear that when cases and controls differ in their allele-sharing characteristics, the U-statistic based on kernel (

Using the GAIN data, we tested the association of COMT haplotypes and case status using the kernel described above. Standardizing to the study population, we obtained a test statistic of 0.996, corresponding to a p-value of 0.318. These results suggest that COMT haplotypes are not associated with Schizophrenia in the GAIN study.

U-statistics are a powerful tool for statistical analysis for a variety of data types. However, the standard U-statistic that compare samples from two or more populations do not allow for differences in confounding covariates in these populations. Using stratification- or propensity-score based weights, we have introduced adjusted U-statistics that adjust for confounding covariates. Using simulated data, we have shown that our adjusted U-statistics have appropriate size when the only association is spurious (due to confounding covariables) and maintain good efficiency against a properly-specified parametric model when a true association is present. We have also developed a closed form variance estimate for the adjusted U-statistics and provided an R-code for implementing our procedure. Finally, we have demonstrated the use of our adjusted U-statistics using genotype data, testing for genetic association between haplotypes in the COMT gene and schizophrenia in an African-American population in which adjustment for confounding by the proportion of African and European ancestry is required for valid inference.

Although a few studies on adjusted U-Statistics have been appeared in the literature.^{11,12,13}, there are some fundamental differences between our approach and their methods in terms of the context, scope, and the basic approaches. Jiang et al.^{11,12} deal only with the question of estimating the association between genotype and trait in a single population; their starting point is a one-sample U statistic that has the special form of a product of a kernel involving only trait information and a kernel involving only genotype information. Because the genotype kernel is linear in genotype ^{11,12} consider only testing a correlation between two variables in a single population, their approach does not generalize to popular two- or multi-sample U-statistics such as the Wilcoxon test we considered here. Rosenbaum^{13} proposes a rank-based U-Statistics for matched pairs, where a set of one-sample quantities (the differences between the case value and control value for each matched set) are used. Thus, in effect, it is a one sample problem. It may be possible to use a permutation approach with our test for small sample in certain situations (e.g., when the confounding model is correctly specified). One needs to ensure that the amount of confounding in each permuted dataset remains the same. Further, the weighting model would have to be re-fit for each permutation (see, e.g., Epstein et al.^{23}).

DISCLAIMER

The conclusions, findings, and opinions expressed by the authors do not necessarily reflect the official position of the U.S. Department of Health and Human Services, the Public Health Service or the Centers for Disease Control and Prevention.

SOFTWARE AVAILABILITY

R-code for the computing the standardized test statistic is available as “SUPPORTING WEB MATERIAL” to this manuscript.

We derive here an iid representation of _{a} that facilitates calculation of its asymptotic distribution. We use simplified notations whenever possible: _{1}_{2}(_{1,W}]^{B}[_{2,W}]^{D}, where ^{c} denotes its mean corrected version _{p}^{−1/2}) terms will be denoted by ≈, where _{1}+ _{2}.

Suppose the group membership is related to covariate by the logistic regression model ^{14}) for consistency and asymptotic normality of the maximum likelihood estimator of _{1}(_{2}(^{2} < ∞.

By using a first-order Taylor series expansion, write

where

Finally, combining _{n} = −(_{1}_{4}
_{2}_{5}
_{3}_{6}). Hence _{a} follows an asymptotically normal distribution with mean

_{1}(_{2}(_{a}, say

In the case that the weight is chosen to standardize to group 1 as defined in

First, consider a weighted mean
^{5} when ^{7} when

Note that only dependence on _{w}(

Thus, after normalizing, we see that inverse-probability-of-group-membership weighting gives the same distribution of covariates

We can now immediately extend this argument to kernels of order (1,1). First write
_{s} denotes the probability law that applies to sample _{w}(

Comparison of empirical p-values and theoretical (uniform) p-values for the Kruskal-Wallis type test (Panel A) and Jonckheere-Terpstra type test (Panel B). Brown (long-dashed curve) corresponds to standardization to the study population, blue (dotted curve) is standardization to the group 1, red (solid curve) is the parametric model, and black (dash-dotted curve) is the naive U-statistic that does not account for confounding.

Power of the adjusted U-statistic for the Kruskal-Wallis type test (Panel A) and Jonckheere-Terpstra type test (Panel B) when the parametric model is correctly specified, and the power of the adjusted U-statistic for the Kruskal-Wallis type test (Panel C) and Jonckheere-Terpstra type test (Panel D) when the parametric model is mis-specified. Solid curve is the Wald test for the parametric model. Long-dashed and dotted curves are adjusted U-statistics that standardize to the study population and group 1, respectively. The parameter

Expected vs. empirical p-values under null hypothesis using Kernel

Empirical size from 1,000 simulated data sets for tests having a nominal size of 5%.

Analysis | Standardization | Size |
---|---|---|

Kruskal-Wallis test (U-statistic) | None | 0.137 |

Kruskal-Wallis test (Adjusted U-statistic) | Study Population | 0.053 |

Kruskal-Wallis test (Adjusted U-statistic) | Group 1 | 0.047 |

Jonckheere-Terpstra test (U-statistic) | None | 0.155 |

Jonckheere-Terpstra test (Adjusted U-statistic) | Study Population | 0.054 |

Jonckheere-Terpstra test (Adjusted U-statistic) | Group 1 | 0.057 |

Wald Test, Logistic Regression | Not applicable | 0.038 |

Size and power for simulated data sets for the genetic example.

_{g} | Power |
---|---|

0.0 | 0.055 |

0.5 | 0.146 |

1.0 | 0.806 |

1.5 | 0.976 |