Individual participant data (IPD) meta-analysis is meta-analysis in which
the individual level data for each study are obtained and used for synthesis. A
common challenge in IPD meta-analysis is when variables of interest are measured
differently in different studies. The term

In response to limitations imposed by traditional meta-analysis, an increasingly
popular approach for data synthesis is individual participant data (IPD) meta-analysis
in which the raw individual level data for each study are obtained and used for
synthesis (

Another advantage of IPD meta-analysis is increased frequencies of low base-rate
behaviors such as suicide or drug use. The frequency of these behaviors may be too low
to be modeled in any single study, but may be high enough when aggregated across
multiple studies. When multiple longitudinal studies are combined, a much broader
developmental period can be considered, given overlapping age ranges across the set of
contributing studies (

IPD meta-analysis has its challenges. In particular, a common situation is when
variables of interest are measured differently in different studies. The term

There are a number of existing methods for data harmonization which make use of
the fact that even if different studies use different outcomes, they are attempting to
measure the same construct or constructs of interest. One approach is to treat the
unobserved measures as missing data and then replace them with plausible values using
multiple imputation (

With multiple imputation, missing values are replaced with two or more plausible
values to create two or more completed data sets. Analyses are then conducted separately
on each data set and final estimates are obtained by combining the results from each of
the imputed data sets using rules that account for within-imputation and
between-imputation variability. See

In the context of harmonization for IPD meta-analysis, multiple imputation has a number of advantages. Once unmeasured variables have been imputed, analyses and their subsequent inferences are based on existing scales of interest. In addition, after the data set has been filled in, it can be shared with other investigators and can be used for numerous analyses using complete data methods. In fact, once a variable has been multiply imputed, it may be used as an outcome in one analysis, and as a covariate in another analysis.

In this article we describe a multiple imputation approach for harmonizing
depression measures across 19 longitudinal intervention trials where there is no single
outcome measure used by all 19 trials. We use the methods of

This article is organized as follows. In Section 2, we describe the example that motivated this work, a study of 19 randomized trials for the prevention of depression among adolescents. In Section 3, we describe our imputation model and diagnostics for checking the quality of imputations when variables are missing for all participants within a study. Section 4 presents the results of applying our methods to the adolescent data and Section 5 offers discussion and areas for future work.

Our motivating example is an ongoing IPD meta-analysis investigating
moderators of treatment effectiveness for the prevention of depression among
adolescents. The project consists of individual participant data from 19 adolescent
depression prevention trials. In 9 of the 19 trials, the intervention was intended
to specifically target youth depression. In the remaining 10 trials, the focus of
the interventions were family-based interventions for behavioral health promotion,
and for substance abuse and HIV/AIDS sexual risk behavior prevention. Each trial was
an RCT with both an intervention and a control group. More details regarding this
project are described in the accompanying article by

The last column in

Roughly speaking, the 19 trials can be placed into three categories based on
which depression measures they used: 1) those trials that use the Children’s
Depression Inventory (CDI) (

Our approach for harmonizing the depression data across the multiple trials
follows that of

We begin by assembling the data in a vertical (long) format, so that each row represents a single participant at a single point in time. Columns are time, demographics, and the 10 different depression measures used across all trials. To account for skewness in outcomes and non-linear trends over time, all depression measures were transformed using a square root transformation. Imputations were also performed on the original scale. Time, measured as the number of months since baseline, was log transformed. Once imputation is complete, all depression measures are back-transformed to their original distributions.

Our imputation model is a multivariate linear mixed-effects regression
model as described by _{i}_{i}_{i}_{1},
…, _{r}_{i}_{i}

_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

Imputations of the missing components of _{i}_{mis}|Y_{obs}

Imputations were performed separately by treatment group so that all of
the parameters in

Fitting the parameters of the model in

Similarly, the RBPC is either measured by itself (Trial 3) or with the
YSR. In fact, in Trial 3 (Familias Unidas 1), the RBPC is the

Looking at

Besides the large number of NA’s which indicate that the
correlation was not estimable because there were no trials which used both
measures on the same participants, what is notable about

Each depression measure is imputed based on a regression which conditions on the remaining depression measures. Thus, when imputing the CDI, not only do we need to be able to measure the pairwise association of the CDI with the other depression measures, we must also be able to measure the association of the other depression measures with themselves. This second condition is slightly problematic because, as mentioned before, not all the measures overlap with each other. The CBCL-A and the CBCL-W do not overlap with the CESD. For this reason, we also remove the CBCL-A and the CBCL-W from our measures to be imputed. The YSR-A and the YSR-W also do not overlap with the CESD, but we cannot remove both these measures because they are the only measures used by Trials 6, 14, 78, and 98. However, the partial correlation of the CDI and the YRS-W, controlling for the YSR-A is only 0.07 at baseline. Therefore, without much loss of information, we can also remove the YSR-W from our imputation model.

In our setting, where the amount of missing data is considerable, and
where we are imputing values for every participant within a trial for some
depression measures, it is particularly important to check the imputation model
and the quality of its imputations. Here, we use posterior predictive checks
using numerical summaries based on test statistics (

Our approach follows the posterior predictive checking and re-imputation
strategy of

Next we generated imputations using the imputation model described
above. Let ^{imp}^{imp}, θ^{imp}, θ

A small

We begin this section by presenting the results of the diagnostics to ensure that our imputations are reasonable and are replicating important relationships relevant to our target analyses. We then analyze the adolescent data using the CDI as our depression outcome of interest. First we only analyze those eight trials which used the CDI. We then analyze all 18 trials using both observed CDI and imputed CDI data.

The first step in evaluating our imputation model is checking the
convergence of the Markov chain used to generate the imputations. We assessed
convergence of our Markov chains by visual inspection of trace plots and
autocorrelation plots. These diagnostics made it very apparent that we were not
able to estimate all of the parameters in our imputation model. For many of the
measures in _{i}

MCMC diagnostics also suggested that our model did not have adequate data to estimate both random intercept and random slope terms for all measures, even with our block diagonal covariance structure. Thus, only random intercept terms were included in our imputation model. This has the result of assuming measure variances are constant over time.

We generated 500,000 parameter draws from our reduced imputation model
from a single Markov chain. Trace plots and autocorrelation plots of those
parameters associated with the CDI reflected convergence of the chain. Formal
diagnostics based on the Geweke Diagnostic (

Although the imputed values in the middle panel of

Despite the findings from the imputation diagnostics, which suggested
that our imputation model is not preserving important features of the data, we
proceeded to discard the duplicated data and analyze the data from the 18
adolescent trials. We analyzed the CDI scores (both observed and imputed) as a
function of treatment and time using the following random intercept and slope
regression model. Let CDI_{ijk}_{i}_{ijk}_{i}

As in our imputation model, time has been transformed as
_{0}_{k}_{0}_{i}_{1}_{i}_{ijk}

In this model, inference focuses on the regression coefficient
_{2}, the time by treatment interaction.
This term is the difference in slopes between intervention and control groups.

We have described a multiple imputation approach for harmonizing outcomes across multiple longitudinal trials. In our motivating example, we initially sought to harmonize 10 measures across 19 trials. This proved to not be possible using our methodology, because there was not enough overlap across measures to enable us to estimate their joint distribution. We then pursued a more modest goal, dropping one of the trials and attempting to harmonize only the CDI which was already used in 8 of the 18 remaining trials and overlapped with most of the other measures in at least one trial. Based on our imputation diagnostics, this reduced model did not appear to preserve relationships among variables or produce accurate imputations. Performing imputations on the original scale of the outcomes or after square root transformation did not improve our results, nor did rounding or not rounding the imputations. Still, this exercise was informative, because it highlighted those conditions that are necessary for harmonizing measures across multiple trials using multiple imputation. We now summarize each of these conditions.

Our imputation model was a two-level hierarchical model where repeated
observations were nested within individual. As a result, clustering at the trial
level was ignored, and observations on different participants within the same
trial were assumed to be independent. See

At first blush, in a data set which contains 19 trials, incorporating
between-trial variability into our imputation model would appear to be feasible.
However, in a setting where a depression measure can be missing for every
participant within a trial, estimating random-effects at the trial level for
each depression measure requires sufficient information to measure the
correlation between measures at the trial level which requires that both
measures must be used together in 3 or more trials. As can be seen in

Two sources of between-trial variability in our data set are the various interventions used in the different trials and the various patient populations. When it is not possible incorporate trial-level variability into the imputation model, one option is to restrict the number of trials to a more homogenous sample with respect to patient population and intervention type. This could potentially remove the need to model trial-level variability at the expense of addressing a different research question.

Again due to the sparsity in our data, we were unable to estimate random
slope effects at neither the trial nor the participant level in our imputation
model. Thus, our models assume that variances of measures are constant over time
and that the correlations between measures are time-invariant. However, looking
at the observed columns in

This last condition seems obvious but it was an issue in our data.
Although all studies sought to measure depression, some studies used self
ratings, others used parent ratings, and some used clinician ratings. Some
subscales sought to measure different components of depression. The result, as
seen in

A useful diagnostic in our setting is the

We sought to harmonize 10 measures across 19 trials and were unable to do so
primarily due to the large amount of missing information, the lack of overlap across
measures, and the low correlations among many of the measures that did overlap. We
pursued harmonization via multiple imputation because, when done correctly, it has
the following advantages: variables remain on their original scale, special
analytical methods are not required after the data have been imputed, relationships
among variables are preserved, and between-trial variability is accounted for. If
the analyst is willing to forgo some of these advantages, other approaches may be
feasible. The simplest approach is to standardize all measures and treat them as if
they were identical on the transformed scale. Standardization can be easily applied
in most situations with continuous measures and does not require specialized
software. However, standardization does not take into account differences in the
measurement properties of different scales and tends to mask heterogeneity between
studies. Interpretation can be difficult because the analysis is no longer on the
original scale (

Latent variable methods which assume a single common factor may be more
feasible in our setting but use sophisticated models and require assumptions
regarding measurement invariance over time that can be hard to check. This is the
approach taken by

A promising approach in our opinion is to bring in additional sources of
information. One way to obtain additional information is by drilling down to the
item level and linking items across measures. When the same items occur in different
measures, an item response theory (IRT) approach can be used (

Another source of (external) information are “bridging
studies” that provide overlap on measures when there is no overlap in the
data set of interest. These bridging studies can be appended to the data set of
interest in order to facilitate harmonization (

Increasingly, researchers are collecting data from multiple studies in order
to synthesize findings and perform more sophisticated analyes. These projects will
continue to grow as federal funding agencies encourage data sharing (

Trial names and trial-level descriptive statistics for the 19 adolescent depression trials consisting of 5547 participants.

Trial ID | Trial Name | N | Percent Female | Age (range) | No. Assess | Duration (months) | No. Measures |
---|---|---|---|---|---|---|---|

| |||||||

1 | New Beginnings | 240 | 49 | 10 (9–12) | 4 | 9 | 4 |

2 | Family Bereavement | 245 | 47 | 11 (7–16) | 3 | 14 | 4 |

3 | Familias Unidas I | 258 | 53 | 13 (12–16) | 4 | 24 | 1 |

6 | Familias Unidas DJJ | 242 | 36 | 15 (12–17) | 3 | 12 | 3 |

8 | Bridges (Puentes) | 542 | 51 | 12 (11–14) | 3 | 14 | 5 |

10 | ADEPT | 94 | 60 | 15 (13–18) | 4 | 24 | 2 |

12 | Project Alliance 2 | 592 | 48 | 12 (10–14) | 3 | 24 | 2 |

14 | Familias Unidas CDC | 160 | 49 | 15 (14–18) | 4 | 12 | 3 |

17 | Family Talk | 135 | 43 | 12 (9–15) | 3 | 18 | 1 |

18 | CATCH IT | 83 | 57 | 17 (14–21) | 6 | 12 | 2 |

28 | Penn Resiliency Program (PRP) I | 697 | 46 | 12 (9–15) | 6 | 24 | 1 |

49 | Prevention of Depression Study (PODS) | 316 | 59 | 15 (13–17) | 3 | 6 | 3 |

50 | K-IPT AST | 57 | 60 | 15 (13–17) | 6 | 18 | 3 |

61 | Narsad IPT | 41 | 85 | 13 (11–16) | 4 | 6 | 3 |

78 | Familias Unidas II | 213 | 36 | 14 (12–16) | 3 | 15 | 3 |

84 | Project Alliance 1 | 179 | 55 | 12 (11–14) | 3 | 24 | 6 |

98 | Preparing for the Drug Free Years (PDFY) and Iowa Strengthening Families Program (ISP) | 667 | 52 | 11(10–13) | 3 | 15 | 2 |

247 | IPT-AST vs CBT | 379 | 55 | 14 (13–18) | 5 | 18 | 1 |

698 | Penn Resiliency Program (PRP) II | 407 | 48 | 12 (10–15) | 6 | 24 | 1 |

Baseline means and missing data patterns of the 19 adolescent depression trials. Measures followed by an (S) are self-reported measures. Those followed by a (P) are parent-reported measures. Those followed by a (C) are clinician-rated.

Trial ID | CDI(S) | CBCL-A(P) | CBCL-W(P) | CBCL-D(P) | YSR-A(S) | YSR-W(S) | RBPC(P) | CESD(S) | CESD10(S) | CDRS(C) |
---|---|---|---|---|---|---|---|---|---|---|

1 | 5.81 | 5.30 | 2.69 | 4.33 | ||||||

2 | 9.74 | 4.74 | 2.95 | 4.68 | ||||||

3 | 3.29 | |||||||||

6 | 4.70 | 3.82 | 5.68 | |||||||

8 | 3.82 | 2.95 | 3.33 | 5.73 | 3.93 | |||||

10 | 7.76 | 24.45 | ||||||||

12 | 7.49 | |||||||||

14 | 5.00 | 3.81 | 5.35 | |||||||

17 | 5.68 | |||||||||

18 | 22.38 | 12.19 | ||||||||

28 | 8.78 | |||||||||

49 | 15.66 | 8.89 | 28.81 | |||||||

50 | 26.37 | 14.02 | 28.14 | |||||||

61 | 25.22 | 13.15 | 27.88 | |||||||

78 | 4.96 | 4.24 | 6.55 | |||||||

84 | 9.38 | 2.99 | 2.25 | 3.58 | 5.75 | 3.92 | ||||

98 | 4.49 | |||||||||

247 | 9.93 | 17.20 | 9.62 | |||||||

698 | 10.88 |

CDI: Children’s Depression Inventory (self-reported)

CBCL-A: Child Behavior Checklist, Anxious/Depressed Subscale (parent-reported)

CBCL-W: Children’s Depression Inventory, Withdrawn/Depressed Subscale (parent-reported)

CBCL-D: Child Behavior Checklist Depression Scale (parent-reported)

YSR-A: Youth Self-Report, Anxious/Depressed Subscale (self-reported)

YSR-W: Youth Self-Report, Withdrawn/Depressed Subscale (self-reported)

RBPC: Revised Brief Problem Checklist, Anxiety/Withdrawal Subscale (parent-reported)

CESD: Center for Epidemiological Studies Depression Scale (self-reported)

CESD10: Center for Epidemiological Studies Depression Scale, 10 items only (self-reported)

CDRS: Children’s Depression Rating Scale (clinician-rated)

The CATCH IT Trial only used the full CESD at baseline

Correlation matrix of all depression measures at baseline across the 19 trials. The number in parentheses under the correlation is the number of trials which used the depression measure (diagonal) or the number of trials using both measures (off-diagonal). Measures followed by an (S) are self-reported measures. Those followed by a (P) are parent-reported measures. Those followed by a (C) are clinician-rated.

CDI(S) | CBCL-A(P) | CBCL-W(P) | CBCL-D(P) | YSR-A(S) | YSR-W(S) | RBPC(P) | CESD(S) | CESD10(S) | CDRS(C) | |
---|---|---|---|---|---|---|---|---|---|---|

CDI(S) | 1.00 (8) | 0.21 (3) | 0.27 (3) | 0.28 (3) | 0.56 (1) | 0.43 (1) | NA (0) | 0.81 (1) | 0.77 (1) | NA (0) |

CBCL-A(P) | 1.00 (4) | 0.53 (4) | 0.77 (4) | 0.18 (2) | 0.13 (2) | NA (0) | NA (0) | NA (0) | NA (0) | |

CBCL-W(P) | 1.00 (4) | 0.65 (4) | 0.07 (2) | 0.21 (2) | NA (0) | NA (0) | NA (0) | NA (0) | ||

CBCL-D(P) | 1.00 (5) | 0.16 (2) | 0.17 (2) | NA (0) | −0.03 (1) | NA (0) | NA (0) | |||

YSR-A(S) | 1.00 (6) | 0.67 (5) | 0.10 (3) | NA (0) | NA (0) | NA (0) | ||||

YSR-W(S) | 1.00 (5) | 0.16 (3) | NA (0) | NA (0) | NA (0) | |||||

RBPC(P) | 1.00 | NA (0) | NA (0) | NA (0) | ||||||

CESD(S) | 1.00 | 0.95 (5) | 0.28 (3) | |||||||

CESD10(S) | 1.00 (5) | 0.27 (3) | ||||||||

CDRS(C) | 1.00 (3) |

Correlation matrix of depression measures at baseline. Only the 5 measures and 18 studies in the reduced imputation model are included. The number in parentheses under the correlation is the number of trials which used the depression measure (diagonal) or the number of trials using both measures (off-diagonal). Measures followed by an (S) are self-reported measures. Those followed by a (P) are parent-reported measures.

CDI(S) | CBCL-D(P) | YSR-A(S) | CESD(S) | CESD10(S) | |
---|---|---|---|---|---|

CDI(S) | 1.00 (8) | 0.28 (3) | 0.56 (1) | 0.81 (1) | 0.77 (1) |

CBCL-D(P) | 1.00 (5) | 0.16 (2) | −0.03 (1) | NA (0) | |

YSR-A(S) | 1.00 (6) | NA (0) | NA (0) | ||

CESD(S) | 1.00 (6) | 0.95 (5) | |||

CESD10(S) | 1.00 (5) |

Design of the duplication and re-imputation strategy. Measures followed by an (S) are self-reported measures. Those followed by a (P) are parent-reported measures.

ID of Duplicated Trial | Measures deleted in duplicate data set | Measures remaining in duplicate data set |
---|---|---|

| ||

1 | CDI(S) | CBCL-D(P) |

2 | CDI(S) | CBCL-D(P) |

84 | CDI(S) | CBCL-D(P), YSR-A(S) |

247 | CDI(S) | CESD(S), CESD10(S) |

Posterior predictive checks of simple correlations at the first three time points in Trials 1, 2, 84, and 247. Results are based on imputed CDI values and their correlation with an observed measure. The CDI and CESD are self-reported measures. The CBCL-D is parent-reported.

Condition | Trial ID | Observed Measure | Baseline | Time 1 | Time 2 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Obs. | Imp. | ppp | Obs. | Imp. | ppp | Obs. | Imp. | ppp | |||

| |||||||||||

Control | 1 | CBCL-D | 0.23 | 0.06 | 0.16 | 0.29 | 0.09 | 0.10 | 0.15 | 0.05 | 0.42 |

2 | CBCL-D | 0.22 | 0.14 | 0.46 | 0.16 | 0.09 | 0.48 | 0.24 | 0.11 | 0.24 | |

84 | CBCL-D | 0.50 | 0.16 | 0.00 | 0.19 | 0.12 | 0.64 | −0.01 | 0.20 | 0.20 | |

247 | CESD | 0.84 | 0.46 | 0.00 | 0.82 | 0.50 | 0.00 | 0.79 | 0.45 | 0.00 | |

Treatment | 1 | CBCL-D | 0.40 | 0.09 | 0.00 | 0.16 | 0.03 | 0.06 | 0.19 | 0.04 | 0.02 |

2 | CBCL-D | 0.19 | 0.11 | 0.46 | 0.14 | 0.05 | 0.26 | 0.05 | 0.08 | 0.68 | |

84 | CBCL-D | 0.24 | 0.08 | 0.20 | 0.01 | 0.08 | 0.54 | 0.25 | 0.10 | 0.26 | |

247 | CESD | 0.78 | 0.38 | 0.00 | 0.66 | 0.39 | 0.00 | 0.75 | 0.45 | 0.00 |

Results from posterior predictive checks of CDI means at the first three time points for control and treatment group participants in Trials 1, 2, 84, and 247.

Condition | Trial ID | Baseline | Time 1 | Time 2 | ||||||
---|---|---|---|---|---|---|---|---|---|---|

Obs. | Imp. | ppp | Obs. | Imp. | ppp | Obs. | Imp. | ppp | ||

| ||||||||||

Control | 1 | 5.33 | 7.64 | 0.00 | 4.51 | 7.14 | 0.00 | 3.68 | 6.56 | 0.00 |

2 | 9.57 | 8.77 | 0.32 | 8.32 | 7.52 | 0.36 | 8.02 | 7.01 | 0.24 | |

84 | 9.46 | 9.35 | 0.88 | 9.53 | 8.39 | 0.20 | 11.02 | 7.90 | 0.00 | |

247 | 10.86 | 10.65 | 0.74 | 12.24 | 11.07 | 0.02 | 10.49 | 9.95 | 0.42 | |

Treatment | 1 | 6.04 | 7.89 | 0.00 | 4.37 | 6.38 | 0.00 | 3.87 | 5.89 | 0.00 |

2 | 9.87 | 8.49 | 0.02 | 7.98 | 6.73 | 0.04 | 7.30 | 6.29 | 0.06 | |

84 | 9.31 | 8.83 | 0.58 | 8.71 | 7.18 | 0.06 | 7.62 | 6.73 | 0.36 | |

247 | 9.19 | 9.53 | 0.60 | 9.12 | 9.14 | 0.98 | 9.43 | 9.09 | 0.52 |

Results from posterior predictive checks of intercept, slope, and difference between slopes in Trials 1, 2, 84, and 247.

Trial ID | Control slope | Tx slope | Tx effect | ||||||
---|---|---|---|---|---|---|---|---|---|

Obs. | Imp. | ppp | Obs. | Imp. | ppp | Obs. | Imp. | ppp | |

| |||||||||

1 | −0.95 | −0.56 | 0.16 | −1.05 | −0.93 | 0.54 | −0.10 | −0.37 | 0.38 |

2 | −0.61 | −0.60 | 0.94 | −0.91 | −0.85 | 0.66 | −0.30 | −0.24 | 0.90 |

84 | 0.30 | −0.39 | 0.00 | −0.37 | −0.73 | 0.14 | −0.68 | −0.34 | 0.38 |

247 | 0.15 | −0.23 | 0.00 | −0.19 | −0.47 | 0.06 | −0.34 | −0.24 | 0.68 |

Observed-only and post-imputation analyses of CDI scores. Observed-only analyses are based on the 8 trials that used the CDI (n=2874 participants). Imputed analyses are based on 100 imputations for all missing CDI scores and include the 8 trials that used the CDI and the 10 trials that did not use the CDI (n=5289 participants).

Observed | Observed and Imputed | |||||||
---|---|---|---|---|---|---|---|---|

| ||||||||

Parameter | Est | SE | t-val | p-val | Est | SE | t-val | p-val |

Intercept | 8.53 | 0.70 | 12.18 | <.001 | 9.38 | 0.44 | 21.49 | <.001 |

Time | −0.43 | 0.08 | −5.36 | <.001 | −0.44 | 0.08 | −5.59 | <.001 |

Tx*Time | −0.26 | 0.09 | −2.76 | .006 | −0.29 | 0.09 | −3.3 | .001 |

SD(_{0}_{k} | 1.92 | 1.64 | ||||||

SD(_{0}_{i} | 6.43 | 5.95 | ||||||

SD(_{1}_{i} | 1.65 | 1.13 | ||||||

Corr(_{0}_{i}_{1}_{i} | −0.47 | −0.37 | ||||||

SD(_{ijk} | 4.16 | 4.48 |

SD(_{0}_{k}

SD(_{0}_{i}

SD(_{1}_{i}

Corr(_{0}_{i}_{1}_{i}

SD(_{ijk}