^{th}Floor, Chicago IL 60611,

Integrative Data Analysis (IDA) encompasses a collection of methods for data synthesis that pools participant-level data across multiple studies. Compared to single-study analyses, IDA provides larger sample sizes, better representation of participant characteristics, and often increased statistical power. Many of the methods currently available for IDA have focused on examining developmental changes using longitudinal observational studies employing different measures across time and study. However, IDA can also be useful in synthesizing across multiple randomized clinical trials to improve our understanding of the comprehensive effectiveness of interventions, as well as mediators and moderators of those effects. The pooling of data from randomized clinical trials presents a number of methodological challenges, and we discuss ways to examine potential threats to internal and external validity. Using as an illustration a synthesis of 19 randomized clinical trials on the prevention of adolescent depression, we articulate IDA methods that can be used to minimize threats to internal validity, including (1) heterogeneity in the outcome measures across trials, (2) heterogeneity in the follow-up assessments across trials, (3) heterogeneity in the sample characteristics across trials, (4) heterogeneity in the comparison conditions across trials, and (5) heterogeneity in the impact trajectories. We also demonstrate a technique for minimizing threats to external validity in synthesis analysis that may result from non-availability of some trial datasets. The proposed methods rely heavily on latent variable modeling extensions of the latent growth curve model, as well as missing data procedures. The goal is to provide strategies for researchers considering IDA.

Integrative Data Analysis (IDA;

IDA presents a number of methodological challenges for data preparation and analytic modeling. The first of these is measure harmonization. Even in a study combining trials targeting the same outcome, say depressive symptoms, measurement of outcomes as well as baseline covariates may differ across trials. A second challenge is heterogeneity in the populations represented across trials. Trials generally differ in population characteristics of race/ethnicity, geographic region, community socioeconomic status as well as community cultures and political histories. Prevention trials often differ dramatically in the level of baseline risk; some trials being universal while others are selective or indicated. A third area of challenge includes heterogeneity in study characteristics. These differences include characteristics unique to the protocol such as the timing of assessments or the mode and target of delivery. This also includes the type of intervention and the number of intervention conditions in the trial.

Some of these challenges have been addressed in the literature. For instance, Item Response Theory (IRT) has been put forth as one method for handling measure harmonization. The method relies on bridging items that are common across different measures and determines dimensionality of the underlying construct, tests for differential item functioning based on subsets of the samples (e.g. gender, age), and finally creates a scale score (

Concerns resulting from heterogeneity across trials have also been addressed in the literature. When IDA is undertaken with a small number of studies, or a group of studies that are not considered representative of an entire population of such studies, an indicator of trial membership can be included as a fixed covariate in fixed effect modeling. With a large number of trials, multilevel or random effects modeling can be used as it is in meta-analysis; see

While solutions to these individual methodologic challenges of IDA analysis have been addressed individually in the literature, these challenges compound quickly when working with data from multiple, longitudinal, randomized clinical trials where follow-up assessment periods, outcome measures, sample characteristics, and control or intervention groups are likely to differ between trials. Even across trials targeting the same outcome, the intervention approaches may differ significantly in terms of delivery target (e.g., delivered to individuals vs. delivered to groups), theoretical framework (e.g., based in interpersonal therapy vs. cognitive behavioral therapy), and number of intervention arms (e.g., control compared to single intervention, control compared to multiple interventions, comparison of two active interventions). All of these challenges raise questions about the internal validity of the synthesis study. External validity, the concern that selection of those trials included in a synthesis may differ from the universe of available trials, is also an important question that needs to be addressed.

The methodologic aims of this manuscript are to articulate methods that address generic questions of internal and external validity in an IDA synthesis study that pools individual participant data across multiple, longitudinal randomized clinical trials. To address challenges to internal validity, we articulate methods and the underlying assumptions used to handle (1) different outcome measures used in different trials, (2) different follow-up assessments employed by the different trials, (3) differences in sample characteristics across trials, (4) combining results across trials containing different comparison conditions, (5) modeling variation in impact trajectories, relying heavily on extensions to latent growth curve modeling, and (6) assessing selection bias threats to external validity based on non-availability of some trial datasets. We address these methodological aims through the example of

The data from

In this paper, we first provide a brief overview of these 19 trials and their variations in measures, assessment, intervention, and sample characteristics. As many of the challenges faced in analysis can be viewed from a missing data perspective, we discuss this view next. Third, we present six challenges, their potential for affecting internal or external validity, and what analytic strategies we have used to reduce these threats. Finally, we discuss general lessons that are applicable to other synthesis projects and methodologic approaches.

We discuss general challenges and ways to minimize threats to validity for synthesizing findings from an analysis of individual level data from multiple trials; we illustrate solutions based on pooling of individual level data across the 19 prevention trials (

A general issue encountered in synthesizing the effects of interventions across trials on a behavioral primary outcome is almost never measured with the same instrument across all studies. Interventionists interested in measuring adolescent depression have a wide range of measures to choose from, and this presents a significant challenge in synthesis work. In our study, eight measures of adolescent depression were identified from the trials: four self-report measures (Youth Self Report - Anxiety/Depression; Youth Self Report - Withdrawal/Depression, Center for Epidemiologic Studies - Depression Scale (CESD), and the Children’s Depression Inventory (CDI)), three parent-report measures (Revised Behavior Problem Checklist - Anxiety/Withdrawal; Child Behavior Checklist (CBC)- Anxiety/Depression; Child Behavior Checklist - Withdrawal/Depression), and one clinician rating (Children’s Depression Rating Scale). Descriptions of the measures are in

Another universal regarding trials is that follow-up assessment schedules vary across trials. Across our 19 trials, the available longitudinal data ranges from six months to fifteen years (see

There are very few examples of exact replication of interventions in behavioral research, at least in part because a funding and scientific emphasis on innovativeness over repetitiveness. The 19 trials included in our analysis certainly differ in both substance and scope. Although all of the trials were prevention trials, they varied in important ways. Only nine of the 19 trials specifically targeted the prevention of depression. The other ten trials targeted other important outcomes (general mental health, externalizing, substance use, high risk sexual behavior), and included measures of depressive symptoms as part of their protocol. Eleven trials utilized two intervention arms, seven trials utilized three intervention arms, and one trial utilized four arms. We categorized the active interventions as focusing on cognitive behavioral therapy (CBT), interpersonal therapy (IPT), or parenting skills development. Eleven trials tested active control arms such as another evidence based intervention (e.g. IPT, CBT, or the active intervention of interest without a key component such as parent groups) and the remaining eight utilized control arms such as bibliotherapy.

Prevention trials in a synthesis often include widely different populations that range from universal, selective, to indicated. In our example the 19 trials differed in terms of their inclusion/exclusion criteria with some trials targeting adolescents who had been in trouble with authority, other trials focusing on adolescents who had recently experienced a loss or family change, and others employing a universal approach. These differences in intervention target also resulted in samples that differed significantly not only in terms of important co-morbidities like externalizing, but also on demographic variables, specifically income and parent-education (see

In any synthesis of multiple longitudinal trials, missing data can be categorized into five distinct types. The first includes data that were missing within a trial due to attrition or incomplete response; a type of missingness that occurs in virtually all longitudinal trials. For example, a single participant may have missed the scheduled 6-month follow-up interview for his or her trial. Or the proportion of missing items left unanswered by a subject at a point in time may have exceeded 20% of the number of items for that measure, a routine cut-off that we used to classify the composite score as missing. The second form is missingness as a result of measure selection. For example, each trial team selected from a wide range of depressive symptom measures; those measures not selected for a particular trial can be considered missing data. Third, the different follow-up schedules in each of the 19 trials can be considered as creating incomplete panel data. Fourth, we can consider each subject in a two-arm trial having two potentially observable measures, one being the set of longitudinal responses if assigned to the active intervention and the other being the set of longitudinal responses if assigned to the control condition. Which of these outcomes is missing depends on the random assignment to condition. Finally, the fifth type of missing data is truncation. We only observe data from the trials whose datasets were shared with us; any other data that could be available from different trials are unknown. Our approach to truncation is very different from that used to handle the other types of general censored data; we discuss this last situation under external validation.

Our general analytic approach relies on full information maximum likelihood (FIML) to handle all types of missing data under a missing at random (MAR) assumption. MAR assumes that missingness is unrelated to outcomes once observed data are taken into account. FIML requires this assumption in order to guarantee unbiased estimates when including data on predictor variables when some outcomes are missing. FIML conditions on observed covariates and maximizes this conditional likelihood after averaging over any missing data, treating that portion of the likelihood involving missing data as irrelevant, or ignorable for making inferences. Methods for analyzing missingness of an individual’s datum at a point in time on a particular measure in a single randomized trial (the first of the types of missing data listed above) has been heavily investigated by statisticians. While no analytic method can be expected to produce accurate inferences under all possible missing data mechanisms when there are large amounts of missing data, approaches that use FIML are known to be highly robust in such situations (

Now consider the second, third, and fourth types of missing data listed above, involving trials having different measures, different follow-up times, and missingness on conditions to which they were not randomly assigned (e.g., active intervention if assigned to control). We argue that for this set of well-conducted trials and each of these situations the data can reasonably be considered to follow mechanisms classified as missing at random (

It is not immediately obvious that we can reasonably ignore missing data mechanisms if all the trials have different follow-up designs with different measures. Different trials have different reasons for missingness (e.g., more universal trials may involve less follow-up and more self-report measures), but whatever the reason, it should be identical for intervention and control groups within each trial as assessment schedules would be the same across conditions and assessors should be blind to condition as well. Thus the second through fourth types of missingness should be ignorable. However, trial then becomes a critical variable to account for in our analysis. Because we know which subjects belong to which trials, trial is an observed covariate for everyone, and analyses that formally include models where each trial is permitted to have a distinct pattern of growth, the inclusion of trial as a fixed effect can then lead to appropriate inferences. Including trial as a random effect, rather than a fixed effect, is also not likely to produce much bias in our overall impact analysis of the difference between intervention and control, but such analyses may possibly hide important variation as they involve averaging over the entire set of trials. We also note for completeness that latent variables are missing for everyone and therefore missing at random (

The estimation of these models also requires that the pattern of observed variables across the trials is sufficient to identify all the parameters. For example, if one particular measure is only observed for a single trial there is no information available to assess its correlation with other measures (

A final important note is about trial selection for inclusion into a synthesis study of this kind. As indicated in

Handling heterogeneity in the primary outcomes is the first major task to address in any IDA. In our example, the outcome measures for adolescent internalizing symptoms varied across the 19 trials, and this presented a significant challenge in harmonizing across the studies. We identified eight measures of adolescent depression that were most common across the 19 trials in order to best capture a common construct of internalizing across all trials.

Extensive procedures were used to ensure equivalent coding across trials. Whenever possible we coded data based on original items rather than precomputed constructs in each of the trials. This allowed us to standardize how summary scores were computed across all trials. For example, individual research teams may have differed in how they handled missing items in the computation of a summary score. By working with item-level data, we standardized this across trials so that a summary score was considered missing if less than 80% of the items were completed by a participant, which is a common construct level decision used by researchers.

Variants in measures abound, as instruments are shortened or extended by individual researchers depending on the time available for a survey or by their unique interests. By design, some of the trials in our example did not measure the entire set of available items on a particular instrument, or used a related, custom version of an instrument. We treated these shortened and custom measures as “surrogate” measures and incorporated them into our analysis design. In one example, a trial administered a custom set of items closely related to the CDI on 77% of the participants, and the full CDI on the remaining 23%. The overlap of items between this custom measure and the CDI allowed us to regress the summary score for the custom measure on the CDI in our model and thereby infer CDI scores on these participants. In another trial, a shortened version of the CESD was used instead of the full measure, and we used a similar regression approach to include these participants. The regression coefficients for this relationship were calculated based on all the trials having full CESD items, as the short form scores were computable. Also, we fixed the regression coefficients to be the same across time panels and trials, then used full information maximum likelihood to account for these surrogates in the analysis.

In any IDA one needs to arrive at a conceptual definition of the primary outcome of interest. In this example, our primary outcome variable was an unobserved latent variable of depressive symptoms identified by the eight measures of depressive and internalizing symptoms represented across the 19 trials. We estimated this latent variable model using only baseline values first to assess for fit and found it was moderately good given the large number of parameters involved (CFI = 0.88, RMSEA = 0.018). In the longitudinal model described below, we used the same latent variable structure at each time point and fixed the path coefficients between the observed measure of depressive symptoms and the latent variable to be equal across time. This ensures stability or consistency of the latent construct over time (

The requirement for internal validity is that the factor analysis model posits a single underlying latent construct of depressive symptoms that covers self-reports, parent reports, and clinical reports, and assumes the same measurement error structure across trials and time. As indicated, the model fit for latent depressive symptoms was adequate, indicating general support for the one-dimensional structure. While a formal test of equivalence across all trials and times would likely find some significant variations, we feel reassured by the stability of the loadings that we found across the many different models that we fit. We note that for our analyses in this paper we have relied on depressive symptom data at the scale level. For future follow-up analyses we plan to use the items themselves in a more complex item response analysis (

As times of follow-up will differ across trials, some decisions need to be made regarding which of two approaches should be used to accommodate these variations. One approach is to model data using each individual’s own time points; the second is to cluster similar time points into batches and analyze as a panel study. Both approaches have advantages and disadvantages; the former approach is taken in

The major analytic challenge in this example, as well as with other IDA for randomized trials, involves estimating latent growth trajectories across time and outcome measure and relate that to overall and specific intervention effects. In our example, we did this by modeling a single latent construct for depressive symptoms for the i^{th} subject in the k^{th} trial (η_{ikt}) at each time point (t). Across the seven assessment periods, we constrained the loadings and intercepts on the measurement model for depressive symptoms to be equal, and allowed the factor variances to change with time, controlling for covariates. Specifically, the j^{th} depression measure assessed at time t for the i^{th} subject in the k^{th} trial, Y_{ijkt}, is an indicator of the underlying latent variable, η_{ikt} representing the underlying latent construct of depression

Here μ_{j} and λ_{j} are means and factor loadings for the j^{th} measure, and ε_{ijkt} is an error term, considered to have a normal distribution with zero mean and unique error σ_{i}^{2}. In these models, trials were treated as fixed effects.

The latent constructs were used as indicators of a latent growth model to estimate change in depression across time. Specifically, we allowed for a general transformation of the time axis to transform time, f(t), that would linearize the effect of the intervention. We allowed our modeling of the growth over time, i.e., “slope” to capture any linear as well as nonlinear pattern across time points. This is especially important in prevention as prevention effects may diminish or even reverse over time, and such patterns would not be detected if we forced the pattern to be linear. Since linearity is specified in a second order growth model with loadings on the slope latent variable equal to the time point where each of the 6 panels were obtained (i.e., 0 for baseline, 6 for 24-month outcome), nonlinearity was accounted for by allowing the second-level factor loadings of the growth model to be estimated by the data (see

Specifically, our second order model at the level of the individual specified how these underlying latent depression variables relate to individual level growth curves specified by their intercepts α_{ik} and slopes β_{ik} on this transformed time scale:

Our second-order latent growth model included these two latent variables: baseline internalizing (α_{ik} or Intercept) and a latent variable for linear change on internalizing (β_{ik} or Slope). In our structural equation modeling we controlled on individual level demographic variables of age, gender, race/ethnicity, family income and parent’s educational attainment for both the second-level intercept and slope. We also adjusted the intercept, and the slope for trial as a categorical factor. This introduction of fixed effects for trial was used instead of two random effects for intercept and slope because the number of trials was too small to estimate these variances and covariances with sufficient precision given the few measures used in each trial. We were also concerned that single random effects may not represent trial level heterogeneity sufficiently well. Because baseline levels of internalizing may well influence the trajectory of internalizing, we regressed the slope on the intercept to control for baseline internalizing. A test of the intervention effect on each (two-arm) trial was based on the impact on slopes of the indicator of intervention status, Z_{ik} = 0 for control and 1 for intervention, after adjusting for a vector of other covariates _{ik}.
_{k} and individual level errors ε_{ik}. The overall effect of the intervention versus control can be obtained using the model,
_{k} represents variation at the trial level, and testing of intervention impact is based on H_{0}: θ = 0 against a two-sided alternative.

A test of moderation of the intervention effect on slope trajectory by baseline level of depressive symptoms was conducted by testing whether the regression coefficients of the latent slope on the latent intercept were different for intervention versus control. The same approach was used to test for other moderator effects at the individual or trial level.

A primary assumption for internal validity in this approach is that variation in change over time within each cluster of follow-up time points is negligible. Since the follow-up assessments that are grouped together are relatively close in time, we are comfortable making this assumption, but recognize that we are unable to fully test for its accuracy. When we examine the estimated trajectory of time (see

When attempting to combine effects across diverse trials, an important issue for generalizability is the controlling of between-study differences in the samples. Though there was some overlap due to common prevention goals in our example, each of the intervention studies had unique inclusion/exclusion criteria leading to differences in presence or level of internalizing or externalizing symptoms at baseline. There were also major differences due to socioeconomic status, with some studies having a more disadvantaged sample than others. Race/ethnicity was another source of between-study heterogeneity, with some trials focused exclusively on participants from a particular ethnic or racial group, and other studies inclusive of participants from multiple racial/ethnic groups. A final important source of between-trial heterogeneity in the samples involved the intervention itself. Some interventions were based in treatment modalities such as Cognitive Behavioral Therapy or Interpersonal Therapy. Other interventions had strong family components with a significant amount of time invested with parents of the target adolescent.

Our first global approach to controlling for between-study differences in the sample was to include trial membership as a covariate. For our study of 19 trials, this meant including 18 dummy-coded variables as covariates on the baseline level of internalizing (intercept) and the trajectory of internalizing across time (slope). Such modeling allows for baseline differences in internalizing across trials, which clearly is supported by the data, and potential shifts in the course of symptoms in control groups by trial. The particular choice of dummy coding, i.e., which trial is used as a contrast, has no effect on the coefficient of overall intervention impact that is our primary interest. By using this approach, we acknowledge that this collection of 19 trials is not necessarily a random selection from a broader population of similar studies. Instead, we incorporate trial membership as a fixed effect within the analytic model which effectively removes variability due to differences between trials (

In the previous section we described our approach to examining an overall impact across all trials. This method is appropriate as long as there is at least one active intervention and a comparison condition in each trial. However, more can be done when there are multiple intervention arms versus a single control, or even when there are intervention arms tested against one another without a control within the same trial, as one would find in a comparative effectiveness trial. We illustrate how we would approach these issues through our specific example. In our example, each of the 19 trials included one or two active intervention arms in the trial to compare against one control condition. We were interested in characteristics of the intervention that may moderate treatment effects, such as type of intervention (CBT, IPT, family-based), recipient of the intervention (child, parent, both conjointly) and whether the trial targeted depression as an outcome. All the active arms in the trial were coded based on these characteristics; all codes were checked for accuracy by the trial principal investigators. This provided us an opportunity to assess whether intervention impact varied by type of intervention or modality.

We attempted several types of analysis, the most general being two-level growth modeling in Mplus (i.e., three-level mixed effects modeling involving time, person, and arm of trial). Unfortunately, these analyses did not converge due to the modest number of trial arms (24) when fit with two correlated random effects for intercept and slope. As an alternate approach, we estimated within-trial intervention effects for each active arm against control using the second-order growth model with slope regressed on each trial (adjusting for intercept and individual level covariates), and extracted these adjusted empirical Bayes estimates and their standard errors into a separate dataset. There were six of the 19 trials with two active intervention arms and these were compared to the same control condition, resulting in the estimation of 24 effects comparing the active intervention versus control. In a single trial with two active arms and one control arm, the two empirical Bayes estimates are dependent and required that we account for such non-independence when analyzing arm-level covariates. For trials with two active arms we used orthogonal transformations to rotate these effect sizes as well as the covariates and then revise their respective standard errors to uncorrelated estimates while retaining the same multilevel mean and variance-covariance structure. These orthogonal transformations (Cholesky decompositions) were based on the eigenvalue-eigenvector decomposition of the level-one variance covariance matrix that preserved the total variance at level one as well as the variance across the trials. Details can be obtained from the last author.

As this project examined variation in impact, we included analytic modeling that allowed subjects in the study to vary both qualitatively and quantitatively in their response to intervention. Following our earlier methods work (_{k}, and its residual variance). Our main finding was that the beneficial effect of these preventive interventions occurred among those who started with an elevated, but subclinical level of symptoms (

The primary threat to external validity in a study of this kind is related to the question of selection bias. That is, is the sample of trials for which data are available representative of the full universe of trial data in this substantive area? With respect to external validity, we are interested in understanding how we can relate the findings from these analyses to other prevention trials in the literature that are not included in this synthesis. Additionally, from the perspective of our research questions, we would be less confident in our conclusions if we found the included trials had stronger impact than did other prevention trials whose individual data we did not obtain. Thus, we looked for an existing report on trials that were comparable to ours on a common outcome measure.

To address representativeness of the trials in our synthesis, we also compared the standard deviations in these two sets of studies, finding twice as much spread in the 86 other depression prevention trials compared to ours. This is not surprising as the other prevention trials contained interventions tested in widely different settings (e.g., schools) and tested on different ages. Other than this smaller variance for our trials, we did not find any other concerns that our select group of trials differed from the larger population of trials.

We also extrapolated our growth model findings for our 24 comparisons to the remaining preventive trials by regressing the difference in intervention versus control slopes on the ESs. Such a procedure is commonly used in model-based survey estimation when there is no formal random sampling selection as is true in our study (

The pooling of individual participant data across multiple, longitudinal, randomized trials is rich with methodological challenges that have implications regarding internal and external validity. Some of these validity issues are shared with meta-analysis. Methodologically, these analyses were difficult to carry out. Not only did trials vary by time of follow-up, they used different measures to assess internalizing symptoms. Many of the methodological approaches employed here have been used in isolation rather than combination (

Within IDA, there are also alternate approaches to the latent variable methods we employed. For instance, missing data, including those data missing as a result of trial design, can be imputed using multiple imputation methods, although such methods have their own limits, as discussed in

As research moves further in the direction of synthesizing participant-level data across trials, the use of common, well-established measures across trials would certainly simplify the analysis problems and provide additional opportunities for assessing goodness of fit. The perspective of using common measures across studies has a time-honored tradition in science, and is one of the major policy changes now under way at the National Institutes of Health. Indeed, the PhenX Toolkit (

Integrative data analysis, whether it is based on a repository of trials or is built through a partnership, is one of three analytic approaches to synthesizing the effects of an intervention across randomized clinical trials. The others include meta-analysis and parallel data analysis (

We have seen the value of IDA in addressing variation in impact through measured as well as unmeasured covariates that interact with intervention. Procedures for sharing of data through partnership have been identified (

There are a number of limitations inherent in this study. Even with the large number of trials in this synthesis, we recognize that there are many other prevention programs that have been tested in trials that we have not included. Although we examined effect sizes for trials not captured in this synthesis and found they were on average reporting more beneficial outcomes, we cannot rule out the possibility that the trials we analyzed differed in other important ways. With the inclusion of more trials we would certainly enhance the overall findings presented in

We gratefully acknowledge the National Institute of Mental Health Collaborative Data Synthesis for Adolescent Depression Trials Study Team, comprised of our many colleagues who generously provided their data to be used in this study, obtained access to key datasets, reviewed coding decisions, provided substantive or methodologic recommendations, and reviewed this manuscript. We also thank NIMH for their support through Grant Number R01-MH040859 (Collaborative Data Synthesis for Adolescent Depression Trials, Brown PI), and NCATS for their support through Grant Number UL1-TR000460 (Miami Clinical and Translational Science Institute, Szapocznik PI), as well as the following grants that supported the trials: NIMH R01-MH048696 (Beardslee), NIMH R01-MH064717 (Beardslee & Gladstone), NIMH R01-MH064503 (Brent), NIMH R03-MH048118 (Clarke), NIDA R01-DA07031 (Dishion), NIDA R01-DA13773 (Dishion), NIMH R01-MH064735 (Garber), WT Grant Foundation 961730 (Garber), NIMH R01-MH064707 (Gonzales), NIDA R01-DA017462 (Pantin), NIDA R01-DA025894 (Prado), CDC U01-PS000671 (Prado), NIMH P30-MH068685 (Sandler), NIMH R01-MH049155 (Sandler & Ayers), NIMH R01-MH052270 (Seligman), NIDA R01-DA007029 (Spoth), NIDA R01-DA018374 (Stormshak), NIMH MH063042 (Szapocznik), NIMH K08-MH072918 (Van Voorhees), NIMH K23-MH071320 (Young), NARSAD Young Investigator Grant (Young).

The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies nor that of our collaborators who provided access to their data.

Compliance with Ethical Standards

CHB was supported as a consultant on one of these projects (New Beginnings) and received funding support on the Familias Unidas trials as did Brincks, Huang, and Pantin who developed and directed much of research on this intervention. Sandler developed and directed much of the work on two trials in this synthesis.

This article does not contain any studies with animals performed by any of the authors. This project involving sharing of data was review by two IRBs. Use of deidentified data in this synthesis project was approved by the University of Miami and Northwestern IRBs, and all institutions signed data use agreements with Northwestern.

All trials included in this synthesis were approved by their respective institutional review boards. Informed consent was obtained from all individual participants included in these respective studies.

Footnote: Though not pictured, the observed depression measures are indicators of the latent internalizing construct at each time point.

Footnotes:

*Effect size where negative effect represents beneficial intervention.

**These trials did not specifically target depression with their intervention.

Comparison of Methods

Models for Synthesis of Legacy Trials | Key Elements | Infrastructure to Support Synthesis | Methodology Strengths | Methodology Challenges |
---|---|---|---|---|

Assemble summary statistics from published reports | Standard methods and tools available (e.g., Cochrane Collaboration) | Methodology well established | Results are limited to findings that have been published or completed; Difficult to harmonize findings across different instruments; Severely limited to main effect or subgroup analyses; Subject to ecological fallacies. | |

Structured data system to document and share all available trial data | Federal research mandates to submit all future trials and availability of a shared database; | Can conduct mediator and moderator analyses; Ability to link individuals across trials, Defined structure for data, documentation, and experiments | Potential misinterpretation data and low quality control involving analyses by individuals not connected to the trial, Legacy trials not included | |

Partnering of trial directors to answer complex synthesis questions | Technology for assembling, harmonizing, and analyzing data from multiple trials | Ability to conduct sophisticated analyses with comparable data across studies | Challenging to obtain individual level data from all available trials | |

Have individual sites conduct analyses on their own data based on a standard protocol, then combine findings at a central site | Supplemental funding of individual trials for new follow-up data and analysis | Does not require sharing of individual level data;Ability to conduct sophisticated analyses with comparable data across studies | Provide technical assistance for producing comparable analyses that can be combined |

National Database for Clinical Trials related to Mental Illness (NDCT),

Collaborative Data Synthesis for Adolescent Depression Trials, Perrino et al., 2014