AUTHOR CONTRIBUTIONS

HL and JPK conceptualized the study. HL performed statistical analysis and wrote the manuscript. JPK and KFL acquired data, reviewed and contributed to discussion. All authors approved the final version of the manuscript.

A nested case-control (NCC) design within a prospective cohort study can realize substantial benefits for biomarker studies. In this context, it is natural to consider the sample availability in the selection of controls to minimize data loss when implementing the design. However, this violates the randomness required for the selection, and it leads to biased analyses. An inverse probability weighting may improve the analysis, but the current approach using weighted Cox regression fails to maintain the benefits of NCC design.

This paper introduces weighted conditional logistic regression. We illustrate our proposed analysis using data recently investigated in TEDDY. Considering the potential data loss, the TEDDY NCC design was moderately selective in its selection of controls. A data-driven simulation study was performed to present the bias correction when a non-random control selection was ignored in the analysis.

The TEDDY data analysis showed the standard analysis using conditional logistic regression estimated the parameter: −0.015 (−0.023, −0.007). The biased estimate using Cox regression was −0.011 (95% confidence interval: −0.019, −0.003). Weighted Cox regression estimated −0.013 (−0.026, 0.0004). The proposed weighted conditional logistic regression estimated −0.020 (−0.033, −0.007), showing a stronger negative effect size than the one using conditional logistic regression. The simulation study also showed that the standard estimate of

Weighted conditional logistic regression can enhance the analysis by offering flexibility in the selection of controls, while maintaining the matching.

Prospective cohort studies are utilized to assess how incident events are influenced by the characteristics of interest in participants followed over time. However, the collection of prospective data can require substantial resources, especially when the incidence of events is low. When resources are limited, it may not be feasible to gather the data from the full cohort over the entire follow-up. A nested case-control (NCC) design is the primary choice in a prospective cohort study to avoid such situations without compromising many of the benefits from the full cohort analysis (

An NCC design includes all event cases up to a specific follow-up time, but selects only a pre-determined number of controls for each case from the event free subjects at the time when a case developed the event (

In this paper, we propose an alternative selection bias corrected analysis in an NCC design. By adopting the approach by (

In a prospective cohort study, we observe time of event or censoring for each participant in follow-up, whichever comes first. When time of event is observed, a “risk-set” is constructed, which includes all participants in follow-up at the event time.

An NCC design includes all cases in follow-up but selects controls for a case from those event free participants in the case’s risk-set in a prospective cohort study. In

Since controls are selected from the same risk-set as the case, an NCC design is considered a matched case-control design with the risk-set as a matching factor. Therefore, this design can also match on potential confounders at a subject-level. Also, with or without intention, this risk-set matching leads to matching on longitudinal data collected between a case and its matched controls (i.e., a sample-level matching). As shown in

In a matched case-control data analysis, conditional logistic regression is primarily used to examine the association between the event and characteristics measured by the time of event. The conditional odds can be written as below:
_{i} includes the case 0 and ^{th} matched case-control set, _{ij}(^{th} subject’s characteristics of interest by the event time ^{th} set, since the design is matched by the case’s risk-set. Although the likelihood [_{i} is fixed by the design, as opposed to the one constructed by chance in the full cohort analysis. The regression parameter β corresponds to the log of the odds ratio for a unit change of _{ij}(

If the matching is broken in an NCC design, the participants included in the design may be considered as a sub-cohort selected from the full cohort. Then, weighted Cox regression can be a choice for selective cohort analysis with the weight being the inverse selection probability for each participant (_{i} is the inverse of the selection probability (_{i}) for the ^{th} subject in the sub-cohort (i.e., _{i} is the risk-set including the subjects in an NCC design who were being followed at the case _{i} is constructed at random among the subjects included in an NCC design. Here, the regression parameter β may correspond to the log of the hazard ratio for a unit change of _{ij}(

This approach has been used to analyze secondary events observed other than the primary for cases in the design (

In

Then, by denoting _{ij} as the inverse of the selection probability {i.e., ^{th} subject in the ^{th} set, the standard conditional likelihood [_{i} stays the same as [

Since the full cohort from which the NCC design participants are selected is available, the full cohort data can be used to estimate the selection probability

This inverse selection probability weighting approach is also useful when the characteristics for event free subjects using the data from an NCC design are of interest. When matching factors other than risk-set were used in an NCC design, the characteristics in selected controls become similar to their cases, rather than those in event free participants in the cohort. Hence, the controls included in an NCC design cannot be directly used to make inference on event free population about the characteristics collected in an NCC design. In this context, this selection bias corrected approach can also help make the inference.

TEDDY is a prospective cohort study across six participating clinical centers: the Pacific Northwest Diabetes Research Institute, Seattle, Washington; the Barbara Davis Center, Denver, Colorado; a combined Georgia/Florida site at the Medical College of Georgia, Augusta, Georgia and the University of Florida, Gainesville, Florida; University of Turku, (Turku, Oulu and Tampere, Finland); Lund University, Malmö, Sweden; and the Diabetes Research Institute, Munich, Germany (

In order to perform analyses across various biomarkers, TEDDY set up two NCC designs: one for islet autoimmunity (IA, the pre-diabetic endpoint) and the other for T1D. At close of the cohort for the NCC design (i.e., sampling time), the median follow-up age was 40 months (first quartile=25 and the third quartile=60). Additional matching factors were having a first-degree relative with T1D (T1D family history), sex, and clinical center located in the region where the participant was enrolled. TEDDY selected controls based on their sample availability in the six potential controls randomly selected from each risk-set (

TEDDY recently investigated whether plasma 25(OH)D concentration (nmol/L) throughout childhood is associated with development of IA in the 1 to 3 TEDDY NCC design (

Since cases are also potential controls until they develop the event of interest, the population for event free subjects (i.e., Y=0) includes cases by their event time, as well as event free subjects by their censored time at the time of design. A logistic regression model was used to estimate the selection probability for being included as a control in the NCC design for IA. We considered the factors related to retention in TEDDY as proxy variables. Previously, TEDDY identified such factors as country where the participant was enrolled, sex, illness experienced during the first year, maternal age, father’s study participation, maternal lifestyle behaviors, and accuracy of the mother’s risk perception (

For the selection bias corrected analyses, the inverse of the selection probability estimate was applied as a weight for the regression parameter estimation. Taking into account the variability of the selection probability estimation, the jackknife variance was calculated and an approximation of the 95% confidence interval was obtained. Without weighting, the standard likelihood analysis was applied to obtain the regression parameter estimate and 95% confidence interval.

As an illustrative purpose, Cox regression was applied after adjusting for those additional matching factors, in order to examine the association between childhood 25(OH)D concentration and IA. The average of 25(OH)D was analyzed as a time varying covariate by calculating it in each risk-set. Without a weighting, ignoring the NCC design, this produces a biased analysis since those subjects in the NCC design are handled as if they were the full TEDDY cohort. As shown in

We also summarized the childhood 25(OH)D concentration by the case-control status (

Based on the TEDDY data, a simulation study was conducted to assess the bias when a non-random control selection was ignored in an NCC design. The controls selected were determined by the 1 to 3 TEDDY NCC design. The prevalence model for IA given a covariate ^{a}) = −3.1533 + ^{a})

in the TEDDY cohort. When ^{a} denotes the matching factors other than the risk-set, ^{a}) = −0.0365 *

Based on the invariance property of the odds ratio, we assumed the covariate model for ^{a}) = ^{a}) +

Assuming the control selection from event free subjects in the cohort also depended on ^{a}, the selection model can be written as ^{a},^{a}) + ^{a}) is a linear function of ^{a} and α is the selection parameter for the dependency between ^{a},^{a}) + α *

Using [

Based on the randomly generated ^{a}); and (

Without the correction, the estimate of

NCC studies are particularly advantageous for longitudinal biomarker studies as they can reduce the high cost and labor associated with collecting complete data in prospective cohort studies. The choice of this design for biomarker studies is growing, not only because it requires a small selection of non-cases, but also because the design can be used with greater flexibility to match on longitudinal variables such as the sample availability/compliance. As the NCC studies become more popular and more flexibly designed, the importance of how well the choice of statistical tool fully respects the way the study is constructed will be vital to produce valid findings from the study.

A key aspect of an NCC design is the selection of a control to pair with a case at a specific time based on the case’s event. The control is selected among the event free subjects at the specific time unique to each case (i.e., the risk-set matching). The chance of the selection must be independent of when the subjects drop out of the study or later become a case themselves in the full cohort (i.e., between risk-set independence). In practice, often a desire is to avoid selecting any controls that become eventually cases in the closed cohort at the time of the design. However, this modifies the risk-sets and violates the between risk-set independence. Then, the design becomes neither an NCC design nor a case-control design, and no standard statistical methods for either design will produce valid analyses. If the implementation of an NCC design maintained the between risk-set independence, the choice of analytical tool should be one of those methods conditioning on the matching. When the matching is ignored (i.e., broken), no statistical modeling will be sufficient to remove the bias given the complexity of longitudinal matching nested within the subject level of matching. For this reason, breaking the matching should be the last choice in the NCC data analysis.

In this paper, we considered when controls were selectively chosen within a risk-set, in order to avoid selecting controls without necessary data for the implementation of an NCC design. We proposed an inverse probability weighting within the matching strata and analyzed the NCC data with weighted conditional logistic regression. Although weighted Cox regression has been available for non-random NCC design, this technique requires the matching to be broken and considers those included in the design as a sub-cohort. This application fails to support the choice of an NCC design to begin with. In order to estimate the selection probability of controls, we used a logistic regression model with the factors related to dropout and compliance.

We illustrated our approach using the TEDDY data analysis. However, the TEDDY NCC design was not completely selective since six potential controls were randomly selected first, from which three were selected based on availability of samples. Therefore, the difference we presented between with and without weighting in the conditional logistic regression analysis may not be greater than that if the design was completely selective. In our simulation study, we kept the status of TEDDY case-control and considered two types of selection probability estimation with and without proxy variables for the risk-set matching. We showed the bias in the analysis without weighting and the bias reduction in weighted conditional logistic regression with both types of weighting. The weighting that considered those factors for the risk-set matching performed better in general but still failed to remove the bias completely. It is likely because the simulated biases did not reflect the risk-set matching when the TEDDY control status was used. Nevertheless, performance may be improved with better estimates of the selection in a future study.

A complete list of the members of the TEDDY Study Group can be found in the

The TEDDY Study is funded by U01 DK63829, U01 DK63861, U01 DK63821, U01 DK63865, U01 DK63863, U01 DK63836, U01 DK63790, UC4 DK63829, UC4 DK63861, UC4 DK63821, UC4 DK63865, UC4 DK63863, UC4 DK63836, UC4 DK95300, UC4 DK100238, UC4 DK106955, UC4 DK112243, UC4 DK117483, and Contract No. HHSN267200700014C from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), Centers for Disease Control and Prevention (CDC), and JDRF. This work supported in part by the NIH/NCATS Clinical and Translational Science Awards to the University of Florida (UL1 TR000064) and the University of Colorado (UL1 TR001082).

CONFLICT OF INTEREST

No other potential conflicts of interest relevant to this article were reported.

Hypothetical example to show a prospective cohort study with 5 participants followed

Hypothetical example to show possible variability in the number of longitudinal data between matched sets in a nested case-control design

Estimates of selection model by logistic regression from the TEDDY full cohort

Odds ratio | 95% confidence interval | |||
---|---|---|---|---|

Lower | Upper | |||

Observed age (Months) | 1.027 | 1.024 | 1.030 | |

Clinical center | Colorado | 0.780 | 0.637 | 0.954 |

Georgia | 0.597 | 0.462 | 0.771 | |

Washington | 0.576 | 0.457 | 0.725 | |

Finland | 1.139 | 0.966 | 1.344 | |

Germany | 0.928 | 0.714 | 1.207 | |

Sweden | 1 | |||

Sex | Girls | 0.758 | 0.668 | 0.861 |

Boys | 1 | |||

T1D family history | Yes | 3.320 | 2.793 | 3.946 |

No | 1 | |||

Father’s participation | Yes | 1.855 | 1.166 | 2.952 |

No | 1 | |||

Maternal age (Years) | 1.018 | 1.005 | 1.031 | |

Number of illness in the first year | 1.016 | 0.999 | 1.033 |

Association between childhood 25(OH)D concentration (average by event time, nmol/L) and Islet Autoimmunity (IA) in the TEDDY 25(OH)D analysis

Approach | Selection bias correction | Likelihood | Regression parameter estimate | 95% confidence interval | |
---|---|---|---|---|---|

Lower | Upper | ||||

Keeping the matching^{1} | Without | Conditional^{3} | −0.015 | −0.023 | −0.007 |

With | Weighted conditional^{4} | −0.020 | −0.033 | −0.007 | |

Breaking the matching^{2} | Without | Partial^{3} | −0.011 | −0.019 | −0.003 |

With | Weighted partial^{4} | −0.013 | −0.026 | 0.0004 |

Conditional logistic regression was used. Childhood 25(OH)D concentration was calculated with the measures by the case’s age of IA for each matched set.

Cox regression adjusted for clinical center, sex and T1D family history was used. Childhood 25(OH)D concentration was calculated at each risk-set to be analyzed as a time dependent covariate.

Likelihood variance estimation

Jackknife variance estimation

The mean 25(OH)D concentration (nmol/L) at the status of IA free in the TEDDY 25(OH)D analysis

Characteristics of | N | Mean (Standard deviation) | ||
---|---|---|---|---|

376 | 51.33 (16.82) | |||

Cases | Selection bias correction | |||

Controls (Keeping the matching) | Without | Controls | 1041 | 54.63 (16.77) |

With | Event free subjects in the cohort | 1041 | 55.04 (17.21) | |

Event free subjects (Breaking the matching) | Without | Selective event free subjects at the time of the design | 999 | 54.83 (16.74) |

With | Event free subjects at the time of the design, by the cases’ event time | 999 | 55.11 (17.24) |

Simulation results from 100 replications: Relative bias (Empirical standard deviation)

True effect size | Selection parameter | Conditional logistic regression | ||
---|---|---|---|---|

Without selection bias correction | With selection bias correction | |||

Selection probability estimation on the matching factors other than risk-set | Selection probability estimation on the matching factors other than risk-set + TEDDY compliance factors including the observed age | |||

−2.0 | −1.25 | 0.972 (0.065) | −0.200 (0.054) | −0.174 (0.065) |

−0.75 | 0.592 (0.069) | −0.360 (0.071) | −0.351 (0.079) | |

−1.5 | −1.25 | 0.984 (0.061) | −0.075 (0.048) | −0.038 (0.063) |

−0.75 | 0.596 (0.059) | −0.224 (0.058) | −0.213 (0.067) | |

−1.0 | −1.25 | 0.995 (0.061) | 0.083 (0.055) | 0.135 (0.071) |

−0.75 | 0.602 (0.056) | −0.080 (0.050) | −0.059 (0.062) | |

−0.02 | −1.25 | 1.012 (0.070) | 0.663 (0.150) | 0.726 (0.148) |

−0.75 | 0.608 (0.061) | 0.295 (0.074) | 0.342 (0.084) |