Causal effects in epidemiology are almost invariably studied by considering disease incidence even when prevalence data are used to estimate the causal effect. For example, if certain conditions are met, a prevalence odds ratio can provide a valid estimate of an incidence rate ratio. Our purpose and main result are conditions that assure causal effects on prevalence can be estimated in cross-sectional studies, even when the prevalence odds ratio does not estimate incidence.

Using a general causal effect definition in a multivariate counterfactual framework, we define causal contrasts that compare prevalences among survivors from a target population had all been exposed at baseline with that prevalence had all been unexposed. Although prevalence is a measure reflecting a moment in time, we consider the time sequence to study causal effects.

Effects defined using a contrast of counterfactual prevalences can be estimated in an experiment and, with conditions provided, in cross-sectional studies. Proper interpretation of the effect includes recognition that the target is the baseline population, defined at the age or time of exposure.

Prevalences are widely reported, readily available measures for assessing disabilities and disease burden. Effects on prevalence are estimable in cross-sectional studies but only if appropriate conditions hold.

A now-common way to define causal effects in epidemiology uses a counterfactual framework [

This general approach shows that causal effects can be defined using contrasts of prevalences [

Although some have defined [

We assume exposure (E) is dichotomous and occurs at an early age _{o}, if at all. Disease (D) can occur at any age, can resolve, in which case we say D is not present, and can recur in people in whom it had resolved.

The outcome-vector [D_{i,a}, S_{i,a}] encodes disease status and survival: disease component D_{i,a} is 1 if individual _{i,a} is 1 if individual _{i} is 1 if individual _{o}_{i,a}(_{i,a}(_{i,a}, S_{i,a}] if E_{i} had been set to _{o}, for _{i,a}(_{i} to

Because an individual cannot have been both exposed and unexposed at age _{o}_{i,a}(1), S_{i,a}(1)] or [D_{i,a}(0), S_{i,a}(0)] is counterfactual.

Clear effect definitions require several components [

Assuming these components are specified, the effect of E on presence of D at age _{i,a}_{i,a}_{o}_{o}

To define a population average effect of E on disease presence at age _{0}) at age _{0} when exposure is set. Then, we define the causal prevalence difference (_{0} as the prevalence in _{0} if all had been exposed at age _{0}, compared with that prevalence if all had been unexposed. In equation form,

The vectors [∑_{i∈P0}
_{i,a}_{i∈P0}
_{i,a}_{0}, not the subpopulation that survives to age

Of note, Flanders and Klein [_{f}, the full population at baseline. Now the target population (P_{0}) coincides with P_{f}, provided the survey population _{1} consists of all survivors from P_{f}. Thus, apart from changing from ratios to differences, the previous [

A natural estimator of the population average causal effect of E on presence of D in population _{0} is the observed difference:
_{j,a1} = ∑_{i∈P0:}
_{i}_{i,a1}/_{j,a1} = ∑_{i∈P0:Ei=j}_{i,a1}/∑_{i∈P0:Ei=j}_{i,a1} is the observed prevalence of D at age _{1} in exposure group _{j,a1} is the observed number alive with exposure equal to _{1}.

In results, we present assumptions that suffice for this estimator to be unbiased in cross-sectional studies; in

Of note, this estimator is simple and involves directly observable variables. As in the definition, deaths including those from “competing risks” are treated realistically, as part of the causal process affecting disease prevalence.

Our goal and main novel result is to state assumptions sufficient for expression [

The effects of exposure on disease presence at a specified time after exposure can be estimated in an experiment. Briefly, one identifies and enrolls subjects, say at age _{0}. For simplicity, we focus throughout on a specific age at exposure (_{0}), although one could include different age groups and calculate a summary measure or model age patterns. We may optionally measure baseline presence of disease (age _{0}) and then expose a random subgroup to E or placebo. We follow the cohort to age _{1} and measure disease presence, assuming no dropouts or loss to follow-up.

Sufficient conditions under which estimator 2 validly estimates causal effects in randomized experiments are exchangeability (_{i,a}(_{i,a}(_{i,a}_{i,a}_{i}

Exchangeability need hold only conditional on measured baseline covariates (C;

We illustrate estimation of effects on prevalence in an experiment. Kuller et al. [

From their results, the prevalence difference is 27% − 16% = 9%. This contrast is not only descriptive, but with our assumptions that should often be plausible in a randomized experiment (previously mentioned), can also be interpreted as estimating the causal effect of the intervention on prevalence of optimal LDL cholesterol. Because prevalence reflects both incidence and duration after onset [

The assumptions that assure unbiasedness of estimator 2 are less straightforward in a cross-sectional survey. We assume that survey participants are randomly selected from a well-defined population of living people _{1} at age _{1} (no surrogates for the deceased). The presence (or absence) of disease at age _{1} (_{i,a1}) and prior exposure at age _{0} < _{1} are accurately assessed.

Perhaps the main challenge is defining the target population needed to clearly define causal effects and for which a survey of population _{1} is expected to yield valid estimates of causal prevalence differences. Because effects require time to occur, the target must have been enumerable at age _{0} before measuring the outcome. We can clearly specify the target if population _{1} consists of all survivors from a larger population _{0} that was alive at the time of potential exposure, age _{0}. Population _{0} should be definable by observable, contemporaneous factors. If we can specify _{0}, then measurement of disease prevalence in the survey provides just the information needed to estimate _{0} been done. In particular, summation over _{0} appearing in estimator 2 can be replaced by summation over _{1} because summands _{i,a1} and _{i,a1} are 0 for individuals who die between age _{0} and _{1}. If the survey involves a 100% sample, prevalences in the sample coincide with those from a cohort study of _{0}, and if less than 100%, prevalences are valid estimators of them because we assume exposure-specific prevalences in survey participants represent those in _{1} (

The other assumptions needed for unbiased estimation coincide with those for experiments and cohort studies. Specifically, we need exchangeability for target population _{0} (_{i,a1} (_{i}_{i,a1} (_{i}_{0} and SUTVA). These assumptions mean that comparison of disease status among exposed and unexposed survivors informs what would have happened if exposure had been randomized in population _{0} at age _{0} and the cohort followed to age _{1}.

Some examples may help illustrate conceptualization of population _{0} (_{1}, and that we focus on exposure at age _{0}. We exclude recent immigrants from _{1} because they were not in the resident population at younger ages. Population _{0} is then the population of residents at the younger age _{0}, _{1} − _{0} years earlier. _{1} should be (or represent) all in _{0} who survive to _{1}. Emigration from _{0} is permissible if independent of disease, survival, and exposure.

_{0} and representativeness of the surveyed population (_{1}) using a directed acyclic graph (DAG). Rules for constructing and interpreting DAGs are reviewed in detail elsewhere [_{0}) of disease and survival (S_{1}). Membership in _{1} depends on survival, not emigrating and other factors, but not directly on E. Participation depends on _{1}, other factors _{1}, but not directly on E. Under these causal patterns relationships, we expect exchangeability for target population _{0} and representativeness survey population _{1}. _{1} and D_{1} affecting membership in _{1}. We now expect bias as survey population _{1} may not represent all survivors from target population _{0}.

If the exposure-specific prevalences in survey participants do “not” represent those in _{1} (i.e., all P_{0}-survivors), estimator [_{0} in

Example 2 illustrates the use of prevalence contrasts for estimating effects in a survey (cross-sectional study). Our goal is to estimate the effect of starting smoking at age 18 years versus never starting or starting later on the prevalence of poor or fair health 20 years later at age 38 years. Self-rated health status is of interest partly because it consistently and strongly predicts subsequent mortality, even after control for multiple other health-status indicators [

The baseline population _{0} consisted of U.S. residents who were 18–21 years old about 20 years before interview. If the exposed and unexposed were exchangeable conditional on controlled covariates, and participants were representative of all U.S. residents aged 38 ± 2 years during this time period—the prevalence odds ratio should consistently estimate the effect of taking up regular smoking at about age 18 years on having self-reported poor health 20 years later. Some U.S. residents died between ages 18 and 38 years possibly because of smoking. Because our interest is in the effect of smoking on prevalence, these deaths do not represent bias because of competing risks but rather are part of the defined effect of smoking [

When epidemiologists consider disease causation they almost invariably consider it in terms of disease onset (i.e., incidence). Rothman et al. developed causal concepts as follows ([

A cohort study is often viewed as a natural design to estimate disease incidence, and a cross-sectional study as a natural design to estimate prevalence as prevalence is a measure reflecting a moment in time. However, to study causal effects, the time sequence must be considered. So, to study causal effects on prevalence, a cohort study would be a natural design. Our assumptions provide conditions wherein observations from a cross-sectional study provide information that adequately approximates information from a cohort study for estimating effects on prevalence.

We have defined causal contrasts that compare the prevalence among survivors from the target population had all in the target been exposed at baseline with that prevalence had they been unexposed. The definition requires specification of the target population, exposure, ages, and other factors [

A key assumption is that the baseline, target population be clearly defined and potentially enumerable. Although explicitly defined in experiments and cohort studies, in cross-sectional studies, this baseline population may require conceptualization as an earlier “parent” population defined so that the population surveyed would consist of all (represent) survivors from the parent population. Identification of a parent population with the needed characteristics creates a situation wherein observations from a cross-sectional study can reproduce those that would have been obtained if a cohort study of that population had been done. Many surveys will not permit clear delineation of the parent population; if not, associated causal-effect definitions may be unsatisfactory. Other assumptions for validity of the observed PD as an estimator of the causal PD in cross-sectional studies are equally important. In particular, exchangeability must be evaluated and, as in cohort studies, can be suspect. It is not expected to hold if confounding is present, perhaps due to causes of disease that are also associated with exposure. If the target is not disease free at baseline, exchangeability can also be suspect if the prevalence at baseline differs between the exposed and unexposed subgroups. In cross-sectional studies involving prevalence contrasts, exchangeability can be threatened by factors that affect either disease duration or risk and are associated with exposure. As with other observational studies, some assumptions are unverifiable, and sensitivity analyses may be useful.

The ages at exposure and at disease measurement must be clearly specified. These ages are critical for several reasons. First, age is a potential confounder, for example, if it affected prevalence and was associated with exposure. Second, age at exposure could be an effect measure modifier. Third, the age and time intervals between exposure (or nonexposure) and outcome measurement can also affect prevalence.

Ideally, exposure would have occurred, if at all, at age a_{0}, the age of the target population at baseline, analogous to recommendations that follow-up in cohort studies begins at or before exposure [_{0} should consist of people who were 50 years old 10 years before the survey and should be defined so that _{1} consists of all survivors from _{0}. _{0} may be easier to define if the survey is population-based (e.g., all 60-year-old, U.S. residents in 2010 excluding recent immigrants), so that it might be defined as the corresponding population, 10 years earlier (e.g., all 50-year-old, U.S. residents in 2000). See

A comparison of prevalences may seem wanting as a causal measure because prevalence is affected by disease incidence and duration (see

W.D.F. is a consultant at Biogen Idec. The remaining authors have no conflicts of interest to disclose.

This publication was supported by US EPA grant R834799. This publication’s contents are solely the responsibility of the grantee and do not necessarily represent the official views of the US EPA or CDC. Furthermore, US EPA does not endorse the purchase of any commercial products or services mentioned in the publication.

We argue that the estimator (expression 2) is unbiased under our assumptions for experiments. By completeness of follow-up, _{j,a1} = ∑_{i∈P0:Ei=j}_{i,a1} for _{i,a1}=1 if subject _{1} and 0 otherwise. By counterfactual model consistency for both _{i,a1}(_{i,a1} (_{i∈P0:Ei=j}_{i,a1}/∑_{i∈P0:Ei=j}_{i,a1} = ∑_{i∈P0:Ei=j}_{i,a1} (_{i∈P0:Ei=j}_{i,a1} (_{i∈P0:Ei=j}_{i,a1} (_{i∈P0:Ei=j}_{i,a1} (_{i∈P0}_{i,a1} (_{i∈P0}_{i,a1} (

To estimate the _{0}, some of whom were exposed at baseline (age _{0}), others not. The assumptions needed for unbiasedness of estimator 2 are the same as those for an experiment. However, exchangeability, which should hold in an experiment with good randomization, needs to be critically evaluated. In particular, we must verify that collider bias [_{1} and assess disease presence.

Our claim, that estimator [

Finally, we argue that estimator 2 is consistent under our assumptions for cross-sectional studies. By assumption, the sample is representative of population _{1}, so _{1}. Also, by assumption, a larger, enumerable population _{0} exists such that _{1} consists of all surviving members of _{0}. Because _{i,a}_{i,a}

This last expression is the same as estimator 2 for the target cohort _{0} if it was the baseline population in a cohort study followed from age _{0} to age _{1}. But estimator 2 is unbiased by our assumptions and arguments mentioned previously for the cohort _{0}.

At times, an alternative estimator that accounts for baseline prevalence may be unbiased even if estimator 2 is biased.

For example, the prevalence at baseline may differ between exposed and unexposed because of factors associated with exposure but unassociated with changes in disease thereafter. Measurement error can affect the decision to further account or adjust for baseline disease status as discussed by Glymour et al. [

Here we consider an approach to defining prevalence effects that provides more details, still rooted in the general causal-effect definition which contrasts parameters of the multivariate counterfactual-outcome distributions presented by Flanders and Klein [_{i,a} defined in the main text can be viewed as an infinite-dimensional vector, _{i}^{th} component of _{i}_{i,a} encodes disease presence, as defined in the main text, for each age ^{th} components of _{i}_{i}_{i}_{i,a}(_{i,a}(_{i,a} respectively. Using _{i}_{i,k}_{0}) to the ^{th} disease episode of subject _{i}_{i} were set to _{i, 1}(_{i,k}^{th} episode of subject _{i} were _{i, 1}(

This formulation includes details about disease onset and duration for potentially multiple disease episodes over time. With it, causal effects of exposure on prevalence can be traced over time since baseline. Additionally, this information allows consideration of the causal effect of exposure on disease onset, survival, cumulative disease duration, average duration per episode, proportion of time with the disease or condition and the number of episodes.

One of the examples used by Flanders and Klein to illustrate their general, multivariate definition of causal effects was the causal conditional risk difference (cCRD) [_{0}, conditional on survival to age _{0} by:
_{i,a}(e) is 1 if the outcome of interest occurs between age _{i,a}(e) replaces D_{i,a}(e) and is defined differently. The definition of cPD is similar, but involves presence of disease at age _{i,a}(e)), rather than occurrence of disease in the risk period starting at age _{i,a}(e)).

Since the denominator of Estimator 2 equals the number in the baseline population (target P_{0}) who survive to age _{1}, one could, and a reviewer did, ask if the estimator in _{1}. A theoretical justification that the estimator in _{0} and SUTVA to show that (∑_{i∈P0:Ei=e}
_{i,a}_{p0,e}, ∑_{i∈P0:Ei=e}_{i,a}_{p0,e}) is an unbiased estimator of the population-average, multivariate effect of E on outcome vector (D_{i}, S_{i}), where _{p0,e} is the number in P_{0} with E = e. Slutsky’s theorem then shows consistency for the ratio contrasts–the prevalence difference.) Collider bias for one target (e.g., P_{1}) but not another (e.g. P_{0}), is also discussed elsewhere [

First, we emphasize that the target population is selected at baseline (time 0), and the exposure is independent of risk factors for disease and survival in this population (exchangeability assumption, perhaps conditional on common causes). Thus, selection (collider bias) from selection of the target is not an issue, by assumption. Furthermore, the final estimator (_{i∈P0:Ei=e}
_{i,a}_{p0,e}, ∑_{i∈P0:Ei=e}_{i,a}_{p0,e}). But this multivariate estimator involves no exclusions, stratification, or control (except for stratification by exposure which is exchangeable), and so involves no collider bias.

Second, we note reassuringly that the cPD (_{0}. The causal prevalence difference addresses the question – “What is the population average effect of exposure on the target population, as measured by disease prevalences at age a_{1}?” Of course other question can and typically should be asked, such as, “What is the population average effect of exposure on the target population, as measured by survival at age a_{1}?” or “What is the population average effect of exposure on the target population, as measured by disease incidence through age a_{1}?” Although other questions exist, by randomizing exposure at baseline, following the exposed and unexposed groups to age a1, and then accurately measuring disease presence and contrasting the prevalences, one can consistently estimate the effect of exposure on prevalence (via estimator 2). Thus, the defined effect (expression 1) is directly observable using simple, well-defined procedures that end by using estimator 2 in a randomized experiment.

Third, we note that effects of exposure, if any, on death before age a1, are part of the defined effect on prevalence (Expression 1), and appropriately reflected in defining the effect and calculating the measure that estimates it. For example, one way in which exposure could cause a reduction in disease prevalence at age a_{1} would be to differentially reduce survival among those who had developed disease. Such a prevalence reduction would be an expected part of an effect on prevalence and correctly estimated, in a randomized experiment or other study under our assumptions. Through use of multivariate outcomes and effects, as described above, a more complete characterization of the exposure’s impacts can be obtained.

Supplemental example 1 illustrates use of prevalence contrasts for estimating effects in a cohort study. To estimate effects of early-life factors on sedentary lifestyle in adolescents, Hallal et al. [

To further illustrate issues that can arise in defining exposure contrasts and the target P_{0}, consider the effects of starting alcohol use at a young age, say age 15, on prevalence of hepatic disease at age 35. We might define exposure as having started regular heavy alcohol use by age 15, and for comparison an “unexposed” group as those who had not started regular, heavy drinking by age 15. The resulting contrast, just as it would be in a cohort study, actually compares the effect of starting alcohol use early (age 15) versus later or never. The presence of people who later became a regular heavy drinker would, just as in a cohort study, reduce the expected effects of heavy drinking–compared to a completely unexposed population of, say, never drinkers. But, even in a cohort study–accounting for changes in exposure might require G-computation or related method [_{0} should be defined, if possible, as those who were 15 years old about 20 years before the survey, in a way that the survey population represents all survivors from P_{0}.

The figure summarizes causal relationships using a DAG for the baseline population _{0}. If correct, estimator 2 should be unbiased (see text). D_{1} represents disease presence at age _{1}, _{0} other causes of survival (_{1}), and disease (D_{1}). _{1} is the survey population; membership depends deterministically on survival, and not emigrating or other loss, but not directly on exposure E or D_{1}. Participation depends on _{1} and other factors _{1}. ^{†}Emigration, other factors affect being in population _{1}.

The figure illustrates a situation such as that in _{1} and D_{1} affect membership in _{1} (see text). D_{1} represents disease presence at age _{1}, _{0} other causes of survival (_{1}), and disease. _{1} is the survey population; membership depends on survival, not emigrating, and D_{1}. Participation depends on _{1} and other factors _{1}. ^{†}Emigration, other factors affect being in population _{1}.

The figure illustrates a situation such as that in _{0}) of prevalent disease D_{1} and emigration. Exposure-specific prevalences in the survey population (_{1}) would be expected to differ from those in all survivors, and bias is expected. _{0} other causes of survival (_{1}), and disease. _{1} is the survey population; membership depends deterministically on survival, not emigrating or other loss, and D_{1}. Participation depends on _{1} and other factors _{1}. ^{†}Emigration, loss other factors affect being in population _{1}. ^{††}_{0} is common cause of prevalent disease and emigration, so prevalence in _{1} expected to differ from that in all survivors.

Definitions and notation

Term | Brief definition |
---|---|

[_{i,a}_{i,a} | Components are counterfactual outcomes _{i,a}_{i,a}_{i,a}_{i,a} |

_{i,a} | First component of counterfactual-outcome vector [_{i,a}_{i,a} |

_{i,a} | second component of counterfactual-outcome vector [_{i,a}_{i,a} |

_{i,a}_{i,a} | Individual causal effect on disease presence at age |

Population-average effect of exposure at age _{0} on disease prevalence at | |

Exchangeability—disease | The counterfactual outcome with E set to _{i,a}_{i}_{i,a}_{i} |

Exchangeability—survival | The counterfactual outcome with E set to _{i,,a}_{i}_{i,a}_{i} |

Consistency | The observed outcome equals the counterfactual outcome if exposure were set to the actual exposure: _{i,a}_{i,a}_{i} |

Stable unit treatment value assumption | The outcome of individual i is independent of the exposure status of all other individuals: _{i,a}_{j} |

Examples of surveys for which the “parent” population _{0} may be specifiable

Survey | Population _{1} | Parent Population _{0} |
---|---|---|

Population-based national survey | U.S. residents, noninstitutionalized, age _{1}—excluding immigrants_{0} and _{1} | U.S. residents, noninstitutionalized,_{0}, (_{0}< _{1}) |

Population-based statewide telephone | State residents, noninstitutionalized, age _{1}—excluding immigrants_{0} and _{1} | State residents, noninstitutionalized,_{0}, (_{0}< _{1}) |

Population-based national telephone | U.S. residents, noninstitutionalized, age _{1}—excluding immigrants_{0} and _{1} | U.S. residents, noninstitutionalized,_{0}, (_{0}< _{1}) |

BRFSS = Behavioral Risk Factor Surveillance System; NHANES = National Health and Nutrition Examination Survey; National Health Interview Survey (NHIS).

_{0}: as in the main text, _{1} is the population (age _{1}) sampled for the survey, and _{0} is the parent population (age _{0}) defined so that _{1} consists of all surviving members of _{0}. To assure temporal precedence, _{0} is less than _{1}.