This is an Open Access article distributed under the terms of the Creative Commons Attribution License (

Because common diseases are caused by complex interactions among many genetic variants along with environmental risk factors, very large sample sizes are usually needed to detect such effects in case-control studies. Nevertheless, many genetic variants act in well defined biologic systems or metabolic pathways. Therefore, a reasonable first step may be to detect the effect of a group of genetic variants before assessing specific variants.

We present a simple method for determining approximate sample sizes required to detect the average joint effect of a group of genetic variants in a case-control study for multiplicative models.

For a range of reasonable numbers of genetic variants, the sample size requirements for the test statistic proposed here are generally not larger than those needed for assessing marginal effects of individual variants and actually decline with increasing number of genetic variants in many situations considered in the group.

When a significant effect of the group of genetic variants is detected, subsequent multiple tests could be conducted to detect which individual genetic variants and their combinations are associated with disease risk. When testing for an effect size in a group of genetic variants, one can use our global test described in this paper, because the sample size required to detect an effect size in the group is comparatively small. Our method could be viewed as a screening tool for assessing groups of genetic variants involved in pathogenesis and etiology of common complex human diseases.

With the completion of the Human Genome Project and continuing advances in gene mapping and sequencing [

There have been several suggested methodologies to reduce the complex interactions of genetic and environmental effects, most notably multi-dimensionality reduction techniques, or MDR [

In this paper, we present a simple method for assessing the overall effect of a group of genetic variants in the context of case-control studies. Although

Mckeown-Eyssen and Thomas [

Suppose that the population at risk is exposed to a level X_{i }of the i^{th }genetic variant (X_{i }can assume only 1 or 0 depending on the presence or absence of the i^{th }genetic variant). Let G_{1}, G_{2},..., G_{k}, and R_{1}, R_{2},..., R_{k}, be the prevalence and the relative risks for the k-genetic variants, which are assumed to be known. Also, let U_{1}, U_{2},..., U_{k}, denote the exposure variables (U_{i }can assume only 0 or 1) among cases for the k-genetic variants. Let

_{0}: R_{1 }= R_{2 }= ... = R_{k }= 1 versus H_{1}: at least one R_{i }≠ 1.

For large sample sizes, the simultaneous test for difference in prevalence between cases and controls is:

where _{1}, using a conservative simplification due to Lachin [

and

_{1}, G_{2},..., G_{k}) and _{1 }*, G_{2 }*,..., Gk*) are the vectors of prevalence of the k genetic variants for controls and cases, respectively. If the test is required to have a specified power (1-β), δ is calculated as the solution to the equation

If the null hypothesis, H_{0}, is rejected, one can conduct subsequent multiple tests to detect which R_{i}s are significantly different from 1 or test subsets of R_{i}s using the same test statistic given above. However, the level of significance of each test has to be adjusted based on the number of multiple tests conducted.

We calculated the sample size required to detect a hypothetical group of k identical genetic variants (all loci are equivalent having equal effects and are independent). Figures

Overall, the sample size requirement declined with increasing values of k. For example, compared with the sample size requirement for k = 1 the sample size requirement for k = 10 declined by approximately 79% on average for all prevalence and risk ratios studied. Prevalences of 0.9 and 0.1 corresponded to the largest sample sizes for all the risk ratios and numbers of genetic variants in the group. There was little difference between sample size requirements for prevalence ranges between 0.3 and 0.6 for large values of k for the given risk ratios. When k is greater than 4 and R = 2.0, the difference in required sample size for the range of prevalence from 0.3 to 0.6 was less than 6 observations. Indicative of this result, the surfaces shown in all three figures have a relatively flat bottom for k greater than 4 and for the range of prevalence from 0.3 to 0.6. As expected, the sample size requirement declined with increasing R. A theoretical explanation of these results is given below.

Let G be the prevalence in the population of the genetic variants in the hypothetical group of k identical genetic variants and G* be the prevalence in cases. We assume independent genetic variants. The denominator in (1) is then given by

Let n_{k }be the sample size requirement corresponding to the group of k genetic variants. Then from (1) and (2),

where δ_{k }is the non-centrality parameter of a chi-squared distribution with k degrees of freedom. This result shows that for k = 10,

and for any given G and R in the hypothetical group of k identical variants, the sample size requirement for k = 10 declined by 79.3% compared to the sample size requirement for k = 1. In a similar manner, one can show that

The difference between δ_{k+1 }and δ_{k }declines with increasing k and

Yang [

Prevalence and odds ratios of five genetic variants for colorectal cancer susceptibility.

Genetic variants | Risk group | Genotype prevalence% | Odds ratio |

Rare allele vs. others | 4.0 | 2.67 | |

Null vs. others | 37.6 | 1.37 | |

α2 allele vs. others | 39.2 | 2.02 | |

Fast acetylation vs. others | [60.3] | 1.68 | |

Wild-type vs. variant (C677T) | 42.3 | 1.35 |

(

Sample size requirement to detect mean exposure between cases and controls for some combinations of genetic variants given in Table 1 assuming multiplicative risk

Genetic variants | Sample size | Genetic Variant | Sample size |

(1) | 283 | (4)+(5) | 236 |

(2) | 656 | (1)+(2)+(3) | 109 |

(3) | 130 | (1)+(2)+(4) | 158 |

(4) | 265 | (1)+(2)+(5) | 215 |

(5) | 705 | (1)+(3)+(4) | 93 |

(1)+(2) | 243 | (1)+(3)+(5) | 110 |

(1)+(3) | 110 | (2)+(3)+(4) | 107 |

(1)+(4) | 168 | (2)+(3)+(5) | 181 |

(1)+(5) | 248 | (3)+(4)+(5) | 108 |

(2)+(3) | 134 | (1)+(2)+(3)+(4) | 92 |

(2)+(4) | 232 | (1)+(2)+(3)+(5) | 107 |

(2)+(5) | 417 | (2)+(3)+(4)+(5) | 106 |

(3)+(4) | 107 | (1)+(2)+(3)+(4)+(5) | 91 |

(3)+(5) | 135 |

GSTT1 and MTHFR have the smallest odds ratios (1.37 and 1.35 respectively) in Table

These results for individual genetic variants seem to carry over to the group of genetic variants. For example, the sample size requirement to detect a group of two genetic variants out of the five given in Table

We have presented a simple method for estimating the sample size for case-control studies required to detect a group of genetic variants using multiplicative models. We have also used the same approach for additive risk models; however, we could not show the asymptotic normality of the joint distribution of exposure for cases (Appendix A2).

In the multiplicative model, when the genetic variants are found to be jointly significant, subsequent multiple tests could be conducted to detect which R_{i}s are significantly different from 1. For example, if the null hypothesis is rejected for a group of five genetic variants, and R_{1}, R_{2 }and R_{5 }are significantly different from 1, we can conclude that the joint effect of G_{1}, G_{2 }and G_{5 }is significantly different between cases and controls.

Consider k hypothesis tests. Under the null hypothesis using the Bonferroni inequality, the probability that at least one of the k tests is significant at level α_{0 }is less than or equal to α_{0}k. In order to maintain an overall level of significance α, we would use the significance level α_{0 }= α/k for each of the k separate tests of significance. Several less conservative adjustments for multiple tests of significance have been proposed, such as the procedure of Holm [

One could have conducted a simultaneous test of the k-parameter joint null hypothesis using multiple tests discussed above as an alternative approach to our test. However, all these tests are conservative compared to the multivariate test presented here. On the other hand, multiple comparison tests could be applied in instances in which the k-statistic vector is not normally distributed, making these tests suitable for the additive model given in the Appendix A2.

Garcia-Closas [

The results obtained here can be easily extended to a group of k genetic variants and l environmental factors, when the exposure to the i^{th }environmental factor can be specified as E_{i }= 1 (present) or E_{i }= 0 (absent) and the E_{i}s are independent among themselves and are independent of the genetic variants.

Our approach is limited by its inability to look at higher order interactions and the assumption of independence between all loci. Covariance terms in the variance-covariance matrix could increase the sample size to detect the group of genetic variants. It is possible that we may not detect individual effects, but there may be joint effects due to interactions. Our method cannot detect these interactions. Our sample size is constrained by our assumption of normal approximation to binomial distribution. Another limitation is the assumption of multiplicative effects of genetic variants. True biologic interactions could be more complex with epistasis and/or other genetic phenomena; furthermore, joint genetic effects and gene-environment interactions on risk may be neither additive nor multiplicative. Unfortunately, for statistical modeling, epidemiologic analyses have had to deal with multiplicative or additive models. The rare disease assumption in case-control studies has been discussed in many papers [

A non-parametric approach to this problem is the method of Multidimensionality Reduction (MDR), introduced by Ritchie [

Another recent approach that holds great promise is logic regression, introduced by Ruczinski [

Suppose there are k genetic variants in a group of genetic variants and only r of them are associated with the disease. The prevalence of each of (k-r) genetic variants that are not associated with the disease (relative risk of each genetic variant is equal to 1) is identical for cases and controls. Therefore, from equation (1), the sample size required to detect the k genetic variants is identical to the sample size required to detect the r genetic variants associated with the disease. Since our sample size is a function of the squares of the difference between prevalence of genetic variants in cases and controls, our method is valid even when we have a combination of positively and negatively associated genetic variants.

One advantage of our method is the simultaneous test of difference of mean exposure instead of multiple testing. Thus, for a range of reasonable numbers of genetic variants, the sample size requirement declines with the increasing number of genetic variants. It is possible that the sample size required to detect a group of genetic variants could increase when adding a genetic variant to the group. However, the sample size required to detect the group with this genetic variant is still less than the sample size required to detect the genetic variant alone or to detect a subset of the genetic variants containing this genetic variant. When testing for an effect size in a group of genetic variants, one can use the global test described in this paper as a screening tool, because the sample size required to detect an effect size in the group is comparatively small. Note that we are comparing the ability to detect at least one of many genetic variants (global test) with the power to detect just one, which are different null hypotheses. If the global test is non-significant, testing for individual genetic variants that require a large sample size is not necessary.

More methodological work is needed in this area to detect joint effects of multiple genetic variants. Our method could be viewed as a screening tool for assessing groups of genetic variants involved in pathogenesis and etiology of common complex human diseases.

The authors declare that they have no competing interests.

RM led statistical analysis and drafted parts of manuscript. QY contributed to statistical design. MK designed, led overall study and drafted parts of manuscript. All authors read and approved the manuscript.

Let f_{0}(X_{1}, X_{2},..., X_{k}) be the joint probability density function among controls and f_{1}(X_{1}, X_{2},..., X_{k}) be the joint probability density function among cases. If

f_{0}(X_{1}, X_{2},..., X_{k}) = Pr [(X_{1}, X_{2},..., X_{k})|_{1}(X_{1}, X_{2},..., X_{k}) = Pr [(X_{1}, X_{2},..., X_{k})| D]

The probability density function of the exposure variables in the population at risk becomes:

f(X_{1}, X_{2},..., X_{k}) = f_{0}(X_{1}, X_{2},..., X_{k})Pr(_{1}(X_{1}, X_{2},..., X_{k})Pr(D)

Assuming the probability of disease is small, we can approximate the distribution of the exposures among the controls by that present among the general population.

_{1}, X_{2},..., X_{k}) ≈ f_{0}(X_{1}, X_{2},..., X_{k}).

Assuming that the exposure variables corresponding to the k-genetic variants are independent, the joint distribution of the k exposure variables is given by

Consider the multiplicative risk model:

where I is the background risk. The average rate of disease in the population at risk is given by

The summation is over all the possible values each X_{i }can assume (0 and 1).

Using (A) and (B), it can be shown that

Yang [

If U_{1}, U_{2},..., U_{k}, denote the exposure variables (U_{i }can assume only 0 or 1) among cases for the k-genetic variants, their joint probability density function is given by the product of the risk function and the probability density function of the exposure variables in the controls divided by M. Lui [_{1}, U_{2},..., U_{k }is given by

where

A comparison of (A) with (C) shows that the joint distribution of exposure among cases has the same form as that of controls; however, they have different parameters for prevalence of the genetic variants and the assumption of independence of exposure variables for controls results in the independence of exposure variables for cases. The prevalence of the i^{th }genetic variant among cases is given by

The mean exposure levels of the k-genetic variants for controls is given by G_{i}, for i = 1, 2,..., k. Similarly the mean exposure levels of the k-genetic variants for cases is given by

_{0}: R_{1 }= R_{2 }= ... = R_{k }= 1 versus H_{1}: at least one R_{i }≠ 1.

This test is identical to the test:

H_{0}:G_{i }= _{1}:G_{i }≠

We assume equal sample sizes for cases and controls. For a large sample size n (the sample size for controls or cases), the variance-covariance matrices of

Under the null hypothesis, the variance covariance matrices for

For large sample sizes, the simultaneous test for difference in prevalence between cases and controls is:

where

Consider the additive risk model:

_{1}, X_{2},..., X_{k}) = a_{0 }+a_{1}X_{1}+ a_{2}X_{2}+...+ a_{k}X_{k}

where a_{0 }= I and a_{i }= (R_{i}-1)I.

The average rate of the disease in the population at risk is given by

where the probability density function, f, is given by (1).

It can be shown that A = a_{0 }+a_{1}G_{1}+ a_{2}G_{2}+...+ a_{k}G_{k}.

Using the notations described for multiplicative models, the probability density function of the exposure levels of k genetic variants among cases is given by:

This is not an identifiable probability density function. Although it can be shown that the marginal distributions have asymptotically normal distributions, this does not guarantee the asymptotic normality of the joint distribution.

The findings and conclusions in this report are those of the author(s) and do not necessarily represent the views of the Centers for Disease Control and Prevention/the Agency for Toxic Substances and Disease Registry.