The nested case–control study design, in which a fixed number of controls are matched to each case, is often used to analyze exposure–response associations within a cohort. It has become common practice to sample four or five controls per case; however, previous research has shown that in certain instances, significant gains in relative efficiency can be realized when more controls are matched to each case. This study expanded upon this and investigated the effect of (i) the number of cases, (ii) the strength of the exposure–response, and (iii) the skewness of the exposure distribution on the bias and relative efficiency of the conditional likelihood estimator from a nested case–control study.

Cohorts were simulated and analyzed using conditional logistic regression.

The relative efficiency decreased and bias away from the null increased, as the true exposure–response parameter increased and the skewness of the exposure distribution of the risk-sets increased. This became more pronounced when the number of cases in the cohort was small.

Gains in relative efficiency and a reduction in bias can be realized by sampling more than four or five controls per case generally used, especially when there are few cases, a strong exposure–response relation, and a skewed exposure variable.

Cohort studies are frequently conducted to evaluate the effect of exposure to a particular physical or chemical agent on the occurrence of or death from a particular disease. The Cox proportional hazard model (

It has been shown that unbiased exposure–response estimates could be obtained by analyzing a sample of the cohort using the conditional likelihood (

However, these results are asymptotic properties; that is, they apply as the size of the cohort (and, therefore, the number of cases in the cohort) approaches infinity. It is not clear how these results hold in situations with small sample sizes, or when there are few observed cases in the cohort due to a rare outcome.

In addition, it seems to have become common practice to simply sample four or five controls per case, even with a rare outcome such as death from leukemia. For example, a recent PubMed search for “nested case–control” and “leukemia” articles published in 2012 returned nine studies. Two of these studies analyzed the full cohort and were not considered. Of the remaining seven studies, six matched five or fewer controls per case, including three studies that only observed 22, 64, and 118 cases. The remaining study observed 71 cases and sampled 10 controls per case. The properties of the conditional logistic regression estimator in these scenarios would not be guaranteed by the asymptotic theory and may be biased and/or inefficient.

While previous work has stated that sampling four or –five controls per case in a matched case–control study is sufficient and there is little to be gained in sampling more controls per case (

This article hopes to expand upon these findings through a simulation study by also considering a continuous exposure variable as well as considering potential bias due to small samples. In particular, this article will investigate the effect of (i) the number of cases, (ii) the strength of the exposure–response, and (iii) the skewness of the exposure distribution on the bias and relative efficiency of the conditional likelihood estimator from a nested case–control study.

Simulations were conducted using SAS Software (version 9.1.3, SAS Institute Inc., Cary, NC). Cohorts were simulated based on methods developed by ^{2} = 64) – truncated between 0 and 50; distribution 2: log-normal(μ = 2.5, σ^{2} = 0.25) – truncated between 0 and 50; and distribution 3: log-normal (μ = 0.75, σ^{2} = 1) – truncated between 0 and 50]. These distributions were chosen to study the effect of skewness on bias and relative efficiency. Distribution 1 is symmetric (skewness of 0), distribution 2 is slightly right-skewed (skewness of about 1.35), and distribution 3 is very right-skewed (skewness of about 3.7). Graphs of the probability density functions for the three distributions are presented in

Each simulated cohort consisted of 5,000 workers. For each scenario with ~30 cases, 10,000 cohorts were simulated, for each scenario with ~100 cases, 3,000 cohorts were simulated, and for each scenario with ~300 cases, 1,000 cohorts were simulated. The number of cohorts varied, since precision is inversely proportional to the number of cases and therefore, the results from the simulations with ~30 cases require 10 times the simulations as those with ~300 cases to achieve the same level of precision. Hence, 10,000 and 1,000 cohorts were simulated.

Each worker was randomly assigned values for age at first exposure (18 years plus a random exponential variable with mean 10) and maximum follow-up time (40 years minus a random exponential variable with mean 5). Each worker was also assigned a maximum exposure duration of 15 years. Therefore, since the exposure intensity was truncated to be below 50, the maximum exposure an individual could accumulate is 750 units (50 units/year × 15 years).

At each year of a worker’s maximum follow-up time, the worker’s current age and cumulative exposure (equal to the worker’s exposure intensity multiplied by exposure duration) were calculated. Also, at each year, a conditional probability of mortality from the outcome of interest (conditional on survival to that age), ^{β}). The parameter α is an intercept parameter which varied in each simulation scenario and was chosen to obtain the desired number of cases (on average). It is not possible to completely control the number of cases in each cohort through this method; rather the number of cases in each simulated cohort will vary.

Additionally, at each follow-up year, a conditional probability of mortality from any other outcome (conditional on survival to that age),

Two Bernoulli random variables were assigned to each worker at each year, one with probability

At first glance, the hazard ratios chosen may seem very small. However, it is important to note that these hazard ratios are per unit of exposure for an exposure where it is possible to accumulate 750 units. To relate these hazard ratios to a specific study, the results must be appropriately scaled. For example, in a study of gold miners exposed to silica,

Risk-sets were created for each cohort, with age as the time scale. For each case, 1, 5, 10, 15, and 20 controls were randomly sampled from the risk-sets. The full as well as the sampled risk-sets were analyzed using conditional logistic regression (procedure PHREG in SAS) to obtain estimates of the exposure–response parameter. The

The PHREG procedure will not converge if, in every risk-set, the case’s exposure is higher (lower) than the maximum (minimum) exposure of the corresponding controls in the risk-set, because the maximum likelihood estimate is infinity (–infinity). In this situation, PHREG will report the last estimate when the optimization algorithm stopped, which most likely will be a very large estimate with a large standard error. When summarizing the simulated results, observations for which the resulting standard error was greater than 1 were excluded, because this was taken as an indication that the procedure had trouble converging. As a result of removing these extreme results, all analyses will be conditional on the algorithm converging, and any summary statistics may be underestimated.

Results from simulations based on distributions 1 and 2; true hazard ratios of 1, 1.005, and 1.015; and ~30 cases and ~100 cases are presented in all tables and figures; complete results can be found in the

The parameter estimates from each scenario using distribution 1 are summarized in

The bias in each scenario was also calculated (

The results from scenarios with ~300 cases and distribution 3 continue the trends summarized above. Namely, bias decreased with more cases but increased as the skewness of the exposure distribution increased. Also, relative efficiency increased with more cases and decreased as the skewness increased. Specific results can be found in the

Previous work has stated that sampling four or five controls per case in a matched case–control study is sufficient, and there is little to be gained in sampling more controls per case (

When the goal is to obtain a precise risk estimate rather than simply detecting a significantly positive estimate, such as in a risk assessment study, more controls should be matched to each case. For example,

In addition to lower precision, such conditions also resulted in bias away from the null in the simulations of this study. For example, in the simulations with ~30 cases, a skewed distribution, and a comparable true hazard ratio of 1.015, the bias was over 15% with five controls matched to each case and 8% with ten controls. Presumably, with only nine cases, the bias in the rubber workers cohort study would be more extreme and could be reduced by sampling additional controls per case. Greater precision and reduced bias would have been desirable to adequately evaluate the effectiveness of the OSHA occupational-exposure limit.

It has been shown previously that relative efficiency decreases as the strength of the exposure–response increases. In fact,

In addition, alternative methods have been proposed to improve the relative efficiency. In particular,

Bias away from the null has also been noted before in the literature for matched case–control studies. A study by

Lastly, in addition to decreased relative efficiency and greater bias, having few cases, a skewed exposure distribution, and a strong exposure–response resulted in an increased number of analyses that did not converge. However, this was only a major issue when one control was matched to each case. When at least five controls were matched to each case, the worst scenario only had a 1.0% of the analyses not converge and this decreased to 0 when there were at least ~100 cases in the study. Therefore, sampling more controls per case, especially when there are a few cases, will help ensure that the resulting analysis will converge and provide a meaningful exposure–response estimate.

A limitation of this study is that it only considered scenarios with one covariate. It is not completely clear how these results would generalize to scenarios with more than one covariate in the model, and this could be the topic of a future study.

In summary, we found that the relative efficiency decreases, as the strength of the exposure–response parameter increases and as the skewness of the exposure distribution increases. Also, considerable bias away from the null was observed when the number of cases in the study was small, however, selecting more controls per case reduced this bias. Consequently, the results of this article (including the complete results listed in the

Graph of the probability density function for the distributions used in the simulations. Distribution 1: normal (μ = 25, σ^{2} = 64) – truncated between 0 and 50, distribution 2: log-normal (μ = 2.5, σ^{2} = 0.25) – truncated between 0 and 50, and distribution 3: log-normal (μ = 0.75, σ^{2} = 1) – truncated between 0 and 50

Relative efficiency vs control-to-case ratio by true hazard ratio. The solid curve T represents the graph of the equation

Percent bias vs control-to-case ratio by true hazard ratio

Summary statistics of the exposure–response parameter estimates for each scenario with exposure intensity distribution 1

True | Match | ~30 cases per cohort | ~100 cases per cohort | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Mean | Empirical | Estimated | Relative | Mean | Empirical | Estimated | Relative | ||||

1 | 1:01 | 10,000 | 1.0000 | 2.16E–03 | 2.02E–03 | 38.7 | 3,000 | 1.0000 | 1.13E–03 | 1.11E–03 | 48.1 |

1:05 | 10,000 | 1.0000 | 1.52E–03 | 1.48E–03 | 78.3 | 3,000 | 1.0000 | 8.52E–04 | 8.45E–04 | 84.5 | |

1:10 | 10,000 | 1.0000 | 1.43E–03 | 1.41E–03 | 88.7 | 3,000 | 1.0000 | 8.20E–04 | 8.07E–04 | 91.4 | |

1:15 | 10,000 | 1.0000 | 1.40E–03 | 1.39E–03 | 92.6 | 3,000 | 1.0000 | 8.10E–04 | 7.95E–04 | 93.5 | |

1:20 | 10,000 | 1.0000 | 1.38E–03 | 1.38E–03 | 94.0 | 3,000 | 1.0000 | 8.08E–04 | 7.88E–04 | 94.0 | |

Full | 10,000 | 1.0000 | 1.34E–03 | 1.34E–03 | 3,000 | 1.0000 | 7.84E–04 | 7.69E–04 | |||

1.005 | 1:01 | 10,000 | 1.0056 | 5.80E–03 | 2.63E–03 | 5.9 | 3,000 | 1.0052 | 1.41E–03 | 1.37E–03 | 33.4 |

1:05 | 10,000 | 1.0051 | 1.71E–03 | 1.66E–03 | 68.4 | 3,000 | 1.0050 | 9.59E–04 | 9.43E–04 | 72.1 | |

1:10 | 10,000 | 1.0051 | 1.57E–03 | 1.54E–03 | 81.4 | 3,000 | 1.0050 | 8.84E–04 | 8.78E–04 | 84.9 | |

1:15 | 10,000 | 1.0051 | 1.52E–03 | 1.49E–03 | 86.7 | 3,000 | 1.0050 | 8.52E–04 | 8.54E–04 | 91.3 | |

1:20 | 10,000 | 1.0050 | 1.49E–03 | 1.47E–03 | 90.2 | 3,000 | 1.0050 | 8.58E–04 | 8.42E–04 | 89.9 | |

Full | 10,000 | 1.0050 | 1.41E–03 | 1.40E–03 | 3,000 | 1.0050 | 8.14E–04 | 8.05E–04 | |||

1.015 | 1:01 | 9,782 | 1.0189 | 1.50E–02 | 8.47E–03 | 1.2 | 3,000 | 1.0160 | 4.15E–03 | 3.34E–03 | 5.2 |

1:05 | 9,999 | 1.0158 | 3.56E–03 | 3.11E–03 | 21.6 | 3,000 | 1.0152 | 1.81E–03 | 1.72E–03 | 27.3 | |

1:10 | 10,000 | 1.0154 | 2.68E–03 | 2.49E–03 | 38.0 | 3,000 | 1.0152 | 1.46E–03 | 1.43E–03 | 42.0 | |

1:15 | 10,000 | 1.0153 | 2.40E–03 | 2.26E–03 | 47.6 | 3,000 | 1.0151 | 1.31E–03 | 1.31E–03 | 51.8 | |

1:20 | 10,000 | 1.0153 | 2.25E–03 | 2.13E–03 | 54.1 | 3,000 | 1.0151 | 1.25E–03 | 1.24E–03 | 57.8 | |

Full | 10,000 | 1.0151 | 1.65E–03 | 1.62E–03 | 3,000 | 1.0150 | 9.46E–04 | 9.70E–04 |

Notes:

Mean is the exponential of the mean of the estimated log hazard ratios.

Empirical standard error is the sample standard deviation of the estimated log hazard ratios.

Estimated standard error is the mean of the estimated standard errors.

Relative efficiency of 1:

Summary statistics of the exposure–response parameter estimates for each scenario with exposure intensity distribution 2

True | Match | ~30 cases per cohort | ~100 cases per cohort | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Mean | Empirical | Estimated | Relative | Mean | Empirical | Estimated | Relative | ||||

1 | 1:01 | 10,000 | 1.0000 | 2.84E–03 | 2.62E–03 | 36.9 | 3,000 | 1.0000 | 1.42E–03 | 1.40E–03 | 44.7 |

1:05 | 10,000 | 0.9999 | 1.95E–03 | 1.89E–03 | 78.5 | 3,000 | 1.0000 | 1.06E–03 | 1.00E–03 | 80.1 | |

1:10 | 10,000 | 0.9998 | 1.84E–03 | 1.80E–03 | 88.3 | 3,000 | 1.0000 | 1.01E–03 | 1.00E–03 | 89.4 | |

1:15 | 10,000 | 0.9998 | 1.80E–03 | 1.76E–03 | 92.1 | 3,000 | 1.0000 | 9.96E–04 | 9.96E–04 | 91.2 | |

1:20 | 10,000 | 0.9998 | 1.78E–03 | 1.75E–03 | 93.9 | 3,000 | 1.0000 | 9.80E–04 | 9.87E–04 | 94.2 | |

Full | 10,000 | 0.9998 | 1.73E–03 | 1.70E–03 | 3,000 | 1.0000 | 9.51E–04 | 9.63E–04 | |||

1.005 | 1:01 | 9,999 | 1.0058 | 4.00E–03 | 3.09E–03 | 11.3 | 3,000 | 1.0052 | 1.50E–03 | 1.40E–03 | 20.8 |

1:05 | 10,000 | 1.0051 | 1.87E–03 | 1.79E–03 | 52.1 | 3,000 | 1.0050 | 9.12E–04 | 9.04E–04 | 56.4 | |

1:10 | 10,000 | 1.0051 | 1.64E–03 | 1.59E–03 | 67.4 | 3,000 | 1.0050 | 8.14E–04 | 8.06E–04 | 70.8 | |

1:15 | 10,000 | 1.0050 | 1.57E–03 | 1.50E–03 | 74.0 | 3,000 | 1.0050 | 7.74E–04 | 7.68E–04 | 78.4 | |

1:20 | 10,000 | 1.0050 | 1.51E–03 | 1.46E–03 | 80.1 | 3,000 | 1.0050 | 7.51E–04 | 7.47E–04 | 83.1 | |

Full | 10,000 | 1.0049 | 1.35E–03 | 1.31E–03 | 3,000 | 1.0049 | 6.85E–04 | 6.79E–04 | |||

1.015 | 1:01 | 7,911 | 1.0206 | 2.20E–02 | 1.51E–02 | 0.3 | 2,992 | 1.017 | 6.62E–03 | 4.50E–03 | 1.0 |

1:05 | 9,897 | 1.0175 | 1.09E–02 | 5.51E–03 | 1.3 | 3,000 | 1.0154 | 2.07E–03 | 1.90E–03 | 10.4 | |

1:10 | 9,987 | 1.0164 | 7.74E–03 | 3.51E–03 | 2.5 | 3,000 | 1.0153 | 1.55E–03 | 1.40E–03 | 18.5 | |

1:15 | 9,996 | 1.0159 | 3.82E–03 | 2.79E–03 | 10.3 | 3,000 | 1.0152 | 1.37E–03 | 1.30E–03 | 23.7 | |

1:20 | 9,996 | 1.0157 | 3.24E–03 | 2.48E–03 | 14.3 | 3,000 | 1.0152 | 1.23E–03 | 1.20E–03 | 29.4 | |

Full | 10,000 | 1.0152 | 1.22E–03 | 1.20E–03 | 3,000 | 1.0150 | 6.68E–04 | 6.92E–04 |

Notes:

Mean is the exponential of the mean of the estimated log hazard ratios.

Empirical standard error is the sample standard deviation of the estimated log hazard ratios.

Estimated standard error is the mean of the estimated standard errors.

Relative efficiency of 1: