Missing data are often problematic when analyzing complete longitudinal social network data. We review approaches for accommodating missing data when analyzing longitudinal network data with stochastic actor-based models. One common practice is to restrict analyses to participants observed at most or all time points, to achieve model convergence. We propose and evaluate an alternative, more inclusive approach to sub-setting and analyzing longitudinal network data, using data from a school friendship network observed at four waves (

Complete social network studies attempt to attain a census, or at least a relatively complete assessment, of the relationships among a bounded population (

Statistical models for complete network data, such as exponential random graph models/

In an effort to minimize the negative implications of missing network data, social network statisticians are working on new approaches for the treatment of missing data. Developments of model-based approaches to estimate missing network data for single (cross-sectional) observations of networks are underway within the ERGM framework (

However, dealing with missing tie information in longitudinal network models is a more complex problem, which has been explored in a smaller number of studies (e.g., _{1}_{2}_{1}_{2}

Huisman and Steglich, and the RSiena manual (

When researchers are faced with this problem of missing data that compromises the convergence of RSiena models, there are a small number of commonly adopted solutions. The majority of researchers to date appear to restrict their analyses to an analytic sub-sample that is observed (e.g., completes surveys)

This study uses longitudinal data (4 waves over 3 years) on youth alcohol use and school-based friendship networks to illustrate the issues outlined above, and to comparing approaches for dealing with missing data and defining analytic samples for longitudinal SABMs, including a new alternative strategy that may be more “inclusive” under certain conditions. Additionally, we validate our findings with a small simulation study. Longitudinal complete social network methods (in particular SABMs) are now commonly employed to investigate how social networks influence risky behaviors in youth (

In this paper, we attempt to fit a longitudinal network model (SABMs) in RSiena to the original data (

Data come from the University of Illinois Bullying and Violence Study (

Six trained research assistants, the primary researcher, and a faculty member collected data. At least two of these individuals administered surveys to classes ranging in size from 10 to 25 students. Students were first informed about the general nature of the investigation. Next, researchers made certain that students were sitting far enough from one another to ensure confidentiality. Students were then given survey packets and the survey was read aloud to them. It took students approximately 40 minutes on average to complete the survey.

The current analyses focus on school friendship networks and alcohol use among middle school students between wave 1 and wave 4 (spring 2008 through fall 2009). We also limit our analyses to one middle school. This resulted in a total possible sample of

Friendships among school-mates were assessed by asking participants to list the first and last names of “the kids at school that you hang out with the most.” Eight response spaces were provided (although the number of friends to nominate was not specified) and they were instructed not to list siblings or friends who did not go to their middle school. These data were used to generate friendship networks at each wave, represented as a directed, asymmetrical adjacency matrix where 1 = a unilateral friendship between participants

Three items from the

Participants reported on their gender (0 = female, 1 = male), their race and ethnicity, and their grade in school. Parent education level was measured as a categorical variable that represents the highest level of education attained by their mother or father, as reported by the participant. This was transformed into a single dichotomous variable where 1 =

Below we describe the approach used to define the Stringent and Inclusive analytic samples, and the approach to specifying models for each of these samples. However, first we will describe the overall modelling strategy applied in this study: SABMs for longitudinal social network data (

The overall approach of the SABM algorithm is to simulate changes in the network and actor behaviors between multiple (>1) discrete observed panels of data, with the network evolving in continuous time as a Markov process. Unobserved network change is assumed to occur in ministeps, where actors have the opportunity to update (change or not change) one tie or behavior at each step. Model parameters identify processes that motivate actors’ decisions to change their network ties or behavior, and the rate at which they make these changes. Some examples of typical model parameters include structural effects on network change (e.g., tendencies for actors to reciprocate network ties, or to send ties to network members who already have many ties), and effects that involve actor attributes (e.g., the tendency for actors to send ties to network members with similar attributes to themselves). For these analyses, model parameters were estimated using a method of moments procedure (note: we could not employ Maximum Likelihood estimation due to the changing composition of the network), and estimates deemed significant if the

Ideally, our data would have minimal missing information about the actors and their ties across the four observed study waves, and we would proceed with defining the analytic sample (and thus longitudinal network) as all participants who completed surveys

The Stringent analytic sample was comprised of participants who completed surveys in 3 of the 4 waves, and thus represents consented students who participated

We applied the standard single-network RSiena model, as described in

In this paper, we are proposing another alternative approach to defining an “Inclusive” analytic sample for longitudinal network analysis in RSiena. The aim is to retain more consented participants with missing survey data in the analytic sample, and therefore utilize more available data on the friendship patterns and behaviors of students who participated infrequently. To do this we decompose the set of study participants in this school (

The network dynamics of each sub-group network, which span a single time transition (e.g., from wave 1 to wave 2), can then be modeled collectively using a “multi-group” SABM in RSiena. The multi-group option allows us to combine multiple networks--or in this case multiple sub-group networks-- into one

A basic model was specified that included factors known to predict youth friendships and alcohol use, and effects that are recommended to adequately model dynamics of complex social networks in RSiena. The same model specification (i.e., the same set of effects) was fit to the Stringent and Inclusive samples (a model fit to the Full Sample would not converge). A forward selection approach, described in

Several effects were included as predictors of friendship choices. Associations between alcohol use and friendship choices were tested with three effects: “alcohol use alter” is an effect of peers’ alcohol use on them receiving an actor’s friendship nomination; “alcohol use ego” is an effect of actors’ alcohol use on their outgoing friendship nominations; and “same alcohol use” captures the extent to which friendships were established between peers with matching alcohol use (based on the binary dependent variable, where 1 = any alcohol use). The roles of gender, school grade, race/ethnicity and parent education on friendship choices were included using the same friendship selection effects (covariate ego, covariate alter, same covariate). Finally, endogenous network effects included tendencies for actors to reciprocate friendships (reciprocity), to befriend friends of friends (transitive triplets) and popular peers (indegree popularity), and other structural features that are often important to adequately explain complex network dynamics (3-cycles, indegree activity, outdegree activity; see

Effects included as predictors of alcohol use were actor-level covariates (gender, school grade, race/ethnicity, parent education) and network effects. The latter were two types of effects testing friend influence on actor alcohol use (average similarity and total similarity), with a positive effect indicating that actors’ alcohol intake became similar to the intake of their nominated friend(s)), and an effect of network indegree, with a positive effect indicating that actors with the most friend nominations were likely to adopt or maintain the highest levels of alcohol use. Linear and quadratic shape effects were included to model the overall distribution of scores.

A substantial proportion of participants in the full sample are excluded when defining analytic samples for the longitudinal network models (

SABMs were fit to the Stringent and Inclusive analytic samples and the two models identified the same significant structural effects and covariate effects that predicted friendship dynamics (although the size of the estimates sometimes varied) (

Although the structural and covariate predictors were consistent in the Stringent and Inclusive sample models, the effects of alcohol use on friendship dynamics differed (

In the Stringent sample and Inclusive sample models, there were no significant covariate or network effects found to predict change in alcohol use (

An additional simulation study was conducted to evaluate the “Stringent” and “Inclusive” strategies for longitudinal network data analysis in RSiena. For this study, we used longitudinal data from the “Teenage Friends and Lifestyle study” (

We sought to replicate the analytic approach described earlier in this paper. First, a standard single network SABM was fit to the complete data set of 160 students to establish a “true” model. Next, we simulated data sets where 10% and 25% of randomly selected participants had missing data (10 data sets were generated with each level of missing participant data), meaning that over the 3 waves, 10% (or 25%) of respondents had all of their data (individual attributes and outgoing friendship ties) coded as missing (NA) to replicate a situation of non-response on a survey at a given wave. Next, the Stringent and Inclusive analytic samples were derived from these simulated data sets with missing node-level data: Stringent analytic samples were comprised of participants who were not missing survey data in any of the 3 waves (i.e., participants with no missing survey response in 3 of 3 waves), and the Inclusive analytic samples were comprised of subgroups of participants who were not missing survey data in any 2 consecutive waves (i.e., no missing survey data at Wave 1 and Wave 2, and/or no missing survey data at Wave 2 and Wave 3).

Models were specified using the same approach described earlier in the paper. Predictors of friendship network dynamics included effects of alcohol use (for this continuous variable we included: alcohol use alter, alcohol use alter squared, alcohol use ego, and similar alcohol use), effects of the gender and pocket money covariates (covariate alter, covariate ego, same/similar covariate), and network effects (reciprocity, transitive triplets, 3-cycles, indegree-popularity square root), and outdegree-activity square root). Predictors of alcohol use dynamics included network effects (average similarity), effects of covariates (male, pocket money), and linear and quadratic shape effects (see

Descriptive statistics for the Full samples, Stringent samples, and Inclusive samples are summarized in the

Results of the RSiena models are summarized in the

Efforts to obtain comprehensive and complete longitudinal network data must be prioritized to ensure that the findings and recommendations generated by this research have minimal bias and error. Nonetheless, non-response by participants and natural variation in longitudinal social network studies remain common and present a challenge for modeling data, particularly when networks are unstable and observed at multiple time points over long periods of time. This paper documents two alternative strategies for reducing the proportion of missing data in analytic sub-samples that will help achieve model convergence, but that may differ in the extent to which they bias the sample and results, particularly when data are not missing at random.

In this proof of concept study of a school-based adolescent friendship network tracked over three years of middle school, we observe substantial missing data due to partial non-response despite very high initial participation rates (95%). Standard SABMs for longitudinal networks and behavior (

We also evaluated an alternate Inclusive approach to defining and analyzing the friendship network and alcohol use dynamics using multi-group SABM, which was able to retain a substantially greater proportion of the original sample in the analytic sample and identify new model effects. This approach defined the longitudinal friendship network data as multiple distinct networks that are each observed over a single time transition (a transition between any two sequential observations), retaining any participant that provided data over at least two consecutive waves. These networks were then analyzed using one multi-group network model (

Despite the contribution of these findings to the field of longitudinal social network analysis, this study does have limitations that need to be recognized. This sampling and analytic strategy was developed and evaluated with a focus on analyzing social network dynamics using SABMs in RSiena, and therefore may not be applicable for when using other analytic strategies for longitudinal network analysis (e.g., relational event models, or temporal exponential random graph models). Also, the data that are the main focus of this study were a friendship network drawn from a study in one Midwest city, and so provide a case study for the implications of using different approaches to dealing with missing data, and possible implications in terms of SABM results. The simulation study results provide additional insights into these two approaches, and the implications of applying them to data missing at random, and missing not at random. Nonetheless, it will be important to replicate these findings with additional samples and types of social networks that may have different types and patterns of missing data.

Overall, this research outlines the challenges of missing data when modeling change in longitudinal networks, and documents important insights into approaches that can be adopted when models will not converge due to missing information. These approaches define analytic samples that remove network members with missing data, and have implications on the analytic sample and model outcomes. Our results clearly indicate that it is ideal to obtain data that has as little missing information as possible. However, when missing data compromises model convergence, researchers should carefully compare and select approaches for dealing with missing information. A strategy that we propose for defining and analyzing an “Inclusive” analytic sample is not commonly employed in practice, but may present a useful alternative for researchers: it may help to retain a greater proportion of network members with missing data, and thus may help to increase power and to detect real network effects, and avoid missing effects when data are not missing at random. The usefulness of these alternate approaches are likely to depend on the prevalence and patterns of missing information, and so researchers are recommended to explore multiple strategies to determine how they impacts their data and model estimates, as these decisions can impact their research findings in potentially profound ways.

Work on this article was support by grant R01DA033280-01 from the National Institute on Drug Abuse (PI: Harold D. Green). This research uses data from the University of Illinois Bullying and Violence Study (PI: Dorothy Espelage; grant from Centers for Disease Control Grant # 5 U49 CE001268-02). We would also like to thank Tom Snijders and Christian Steglich for their valuable advice on modeling large longitudinal social network data, and for their suggestions which led to the concepts being evaluated in this paper.

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

exponential random graph model

stochastic actor-based models

Visualization of missing social network data. Nodes represent individuals, and directed lines represent relationships between a pair of individuals. With 80% participation, Node 4 and Node 9 are missing data (individual information, including their outgoing relationships). With 60% participation, Node 1, Node 4, Node 9 and Node 10 are missing data.

Grade-level cohort observations across waves, and their inclusion in time period subgroups (G = grade; N/A = not observed at that wave)

Demographic, alcohol use, and network characteristics of the full, stringent, and inclusive samples

Characteristic | Full Sample | Stringent Sample | Inclusive Sample |
---|---|---|---|

N | 694 | 313 | 427 |

% male | 51 | 50 | 50 |

Race/ethnicity (%) | |||

American Indian | 2 | 2 | 1 |

African American | 82 | 80 | 81 |

Asian | 0 | 0 | 0 |

Hispanic | 6 | 7 | 6 |

White | 8 | 8 | 8 |

mixed | 1 | 0 | 1 |

other | 2 | 2 | 3 |

% parent with college education | 62 | 61 | 61 |

Any past year alcohol use (%) | |||

W1 | 32.3 | 25.3 | 25.7 |

W2 | 30.1 | 26.2 | 29.5 |

W3 | 31.7 | 28.5 | 31.8 |

W4 | 29.6 | 29.9 | 29.6 |

M ( | |||

W1 | 4.5 (3.3) | 4.8 (3.4) | 4.7 (3.4) |

W2 | 3.8 (3.4) | 4.3 (3.3) | 4.0 (3.4) |

W3 | 3.6 (2.7) | 4.0 (2.7) | 3.7 (2.7) |

W4 | 3.2 (2.7) | 3.6 (2.7) | 3.6 (2.7) |

SABM results for Inclusive sample and Stringent sample: Effects on friendship dynamics

PARAMETER | Stringent Sample (Standard SABM) | Inclusive Sample (Multi-group SABM) | Difference | ||||
---|---|---|---|---|---|---|---|

Est. | Est. | ||||||

Rate period 1 | 26.55 | 2.44 | 0.000 | 17.30 | 1.82 | 0.000 | |

Rate period 2 | 10.96 | 0.75 | 0.000 | 11.35 | 0.71 | 0.000 | |

Rate period 3 | 25.77 | 4.38 | 0.000 | 24.35 | 2.72 | 0.000 | |

outdegree | 0.12 | 0.000 | 0.19 | 0.000 | |||

reciprocity | 0.09 | 0.000 | 0.09 | 0.000 | |||

transitive triplets | 0.05 | 0.000 | 0.05 | 0.000 | |||

3-cycles | 0.08 | 0.000 | 0.07 | 0.000 | |||

indegree popularity (sqrt) | 0.00 | 0.04 | 0.982 | 0.02 | 0.04 | 0.669 | |

indegree activity (sqrt) | 0.09 | 0.000 | 0.08 | 0.000 | |||

outdegree activity (sqrt) | 0.03 | 0.000 | 0.03 | 0.000 | |||

same race | 0.06 | 0.000 | 0.05 | 0.000 | |||

male alter | 0.04 | 0.001 | 0.04 | 0.000 | |||

male ego | 0.05 | 0.018 | 0.05 | 0.000 | |||

same male | 0.04 | 0.000 | 0.04 | 0.000 | |||

same grade | 0.06 | 0.000 | 0.05 | 0.000 | |||

parent education alter | −0.02 | 0.04 | 0.539 | 0.01 | 0.04 | 0.889 | |

parent education ego | −0.03 | 0.04 | 0.522 | −0.04 | 0.04 | 0.416 | |

same parent education | 0.02 | 0.04 | 0.608 | 0.06 | 0.04 | 0.119 | |

alcohol use alter | −0.07 | 0.08 | 0.374 | 0.31 | 0.010 | * | |

alcohol use ego | 0.09 | 0.017 | 0.30 | 0.008 | |||

same alcohol use | N.S. | 0.36 | 0.001 | * |

Note. Est. = Unstandardized parameter estimate. N.S. = Not statistically significant. These effects were not included in the final model because they were found to be nonsignificant during the forward selection model specification. Effects in bold are significant at the p<.05 level.

SABM results for Inclusive sample and Stringent sample: Effects on alcohol use

Stringent Sample (Standard SABM) | Inclusive Sample (Multi-group SABM) | |||||
---|---|---|---|---|---|---|

PARAMETER | Est. | p-value | Est | p-value | ||

Rate period 1 | 1.04 | 0.33 | 0.002 | 0.80 | 0.19 | 0.000 |

Rate period 2 | 0.86 | 0.18 | 0.000 | 0.65 | 0.12 | 0.000 |

Rate period 3 | 0.66 | 0.16 | 0.000 | 0.47 | 0.15 | 0.001 |

linear shape | −0.17 | 0.48 | 0.725 | 0.22 | 0.98 | 0.825 |

male | N.S. | N.S. | ||||

grade | N.S. | N.S. | ||||

parent education | N.S. | N.S. | ||||

race/ethnicity | ||||||

Black | 0.74 | 0.55 | 0.180 | N.S. | ||

Hispanic | N.S. | −1.84 | 1.89 | 0.330 | ||

White | N.S. | N.S. | ||||

average similarity | 2.17 | 1.54 | 0.159 | 3.46 | 4.17 | 0.406 |

total similarity | N.S. | N.S. | ||||

indegree | N.S. | N.S. |

Note. Est. = Unstandardized parameter estimate. N.S. = Not statistically significant. These effects were not included in the final model because they were found to be nonsignificant during the forward selection model specification. No differences in significant model effects were found.

Missing data in longitudinal social network studies are common and problematic

We review approaches to longitudinal network analysis in RSiena with missing data

Restricting the analytic sample to actors with complete cases is common practice

An alternative approach can result in analytic samples that are more representative

Differences in the results of RSiena models using these approaches are documented