^{1–3}

The collection of papers in this journal supplement provides insight into the association of various covariates with concentrations of biochemical indicators of diet and nutrition (biomarkers), beyond age, race and sex using linear regression. We studied 10 specific sociodemographic and lifestyle covariates in combination with 29 biomarkers from NHANES 2003–2006 for persons ≥20 y. The covariates were organized into 2 chunks, sociodemographic (age, sex, race-ethnicity, education, and income) and lifestyle (dietary supplement use, smoking, alcohol consumption, BMI, and physical activity) and fit in hierarchical fashion using each chunk or set of related variables to determine how covariates, jointly, are related to biomarker concentrations. In contrast to many regression modeling applications, all variables were retained in a full regression model regardless of statistical significance to preserve the interpretation of the statistical properties of

A vast amount of data is collected on each sampled person in the continuous NHANES survey, providing a unique opportunity to assess and describe the nutritional status of the US population. However, NHANES cannot assess cause and effect. The variables collected in observational studies, such as NHANES, have not been experimentally manipulated and/or randomly assigned. In this setting any causal pathway becomes obfuscated and differences may provide little insight into the cause and effect. Observational studies, however, can still provide an approximate description of patterns in the data and form a basis to estimate associations and perform hypothesis testing after controlling simultaneously for many variables, though estimates may always be biased due to residual or unmeasured confounding.

Application of any statistical method first requires a well-formulated problem within the scope of the study design’s ability to provide solutions. Adhering to the tenets of the scientific method should precede any statistical analysis. While the basic assumptions of the statistical method remain important, uncritical application and/or repeated application of a statistical modeling analysis without a well-formulated plan can simply capitalize on the random variation and lead to a model that has little utility for prediction, statistical estimation or testing, and rather leads to false positive findings (

One of the hallmarks of the scientific method is a “feedback loop” between theory and practice as we further refine our hypotheses after accumulating new facts (

The dependent variables in the papers in this journal supplement include biomarkers of diet and nutrition measured in adults ≥20 y who provided a biological specimen during their examination at the mobile examination center in NHANES 2003–2006. Some biomarkers were only available for a subset of the full sample or for only 2 of the 4 survey years, i.e. 1 cycle (

Composite variables are often used in public health messaging and the scientific literature. For fat soluble biomarkers, the following composite variables were created by summing a group of chemically related compounds: carotenes, xanthophylls, saturated, monounsaturated, polyunsaturated, and total fatty acids. These composite variables were only calculated for persons who had non-missing values across all corresponding biomarkers. Therefore, a small number of values were missing for these composite variables relative to the individual biomarkers (see

Ten specific sociodemographic and lifestyle factors were selected as covariates based on the information available in NHANES and on evidence in the literature that these variables may be related to nutritional biomarkers. The sociodemographic variables included age, sex, race-ethnicity, education level, and family poverty income ratio (PIR^{4}). For bivariate analyses, we categorized the sociodemographic variables as follows: age (20–39 y, 40–59 y, and ≥60 y); race-ethnicity (Mexican American [MA], non-Hispanic black [NHB], and non-Hispanic white [NHW]); education (<high school, high school, and >high school); PIR was calculated by dividing total family income by the poverty guidelines adjusted for family size at year of interview (^{2}) was categorized using WHO guidelines for underweight (<18.5), normal (18.5–<25), overweight (25–<30) and obese (≥30) (

The physical activity variable was constructed using files that provide detailed information about specific leisure time physical activities (

Average daily alcohol consumption was derived from the alcohol use questionnaire as: [(quantity × frequency) / 365.25]. Respondents were asked about their alcohol use where a drink was defined as a 12 oz. beer, a 5 oz. glass of wine, or 1.5 oz of liquor. This is equivalent to a “standard” drink in the United States, which contains 0.6 US fluid oz (18 mL) of alcohol and corresponds to 14.2 g of ethanol. Persons who reported having less than 12 drinks of any type of alcoholic beverage in the past year (or lifetime) were considered nondrinkers. For descriptive purposes, alcohol consumption was categorized in the following groups: no drinks, <1, 1–<2, and ≥2 drinks/d.

In a few cases additional variables were added to the full model (sociodemographic and lifestyle factors) to provide important adjustments that might be expected for certain biomarkers. For the urine biomarkers, urine creatinine concentration was included as a covariate to adjust for the dilution of the spot urine. For fat-soluble nutrients, total cholesterol and prescription use of lipid-altering drugs was included because some fat-soluble nutrients are transported in the plasma by lipids and to adjust for drug-related changes in fat absorption and/or lipid metabolism. For 25-hydroxyvitamin D, season and latitude were included as proxies to adjust for sun exposure, which has been shown to have an impact on vitamin D status.

The mathematical form of the continuous covariates (age, PIR, BMI, physical activity, and alcohol consumption) was assumed to be linear in the regression model. A log transformation for BMI, alcohol consumption, and physical activity was applied to these covariates; although not a necessary assumption, linear regression is more robust when the independent variables have an approximately normal distribution (

The analysis plan entailed computing Spearman correlations to describe bivariate associations between each biomarker and selected continuous variables. Bivariate associations between each biomarker and categorical variables were described with geometric means (or arithmetic means where appropriate) and 95% CI across the categories. The means were compared across the categories on the basis of Wald F tests (tests whether at least 1 of the means across the categories is significantly different from the others). Geometric or arithmetic means were not presented if the minimum sample size of 42 was not reached (assumed average design effect of 1.4 multiplied by 30) (

At the outset, we identified 10 different covariates for 29 different responses (biomarkers) and decided to keep all variables in a full regression model regardless of statistical significance. Covariates were arranged into 2 sets or “chunks” of sociodemographic factors and lifestyle factors. We tested these covariates in a hierarchical, chunk-wise fashion such that each chunk or set of related variables was tested simultaneously (^{2}

A factor that limited the number of

Statistical analyses were carried out using SAS for Windows software version 9.2 (SAS Institute, Cary, NC) and SAS-callable SUDAAN (SUDAAN Release 10.0, 2008 RTI, Research Triangle Park, NC) to account for the unequal probability of inclusion, stratification, and clustering. SUDAAN offers Taylor series linearization to account for the effect of stratification and clustering on the variance estimates. The weights used depended on whether the specimens tested constituted a full or a subsample of all the eligible participants examined at the MEC and how many survey periods were combined to produce the estimate (

We used a row-labeled plot to illustrate the increase in ^{2}^{2}^{2}

To illustrate how controlling for more covariates selectively affects various biomarkers, we plotted the change in the

Among adults ≥20 y in the non-institutionalized, civilian US population in 2003–2006, 23% were ≥60 y, 52% were female, 72% were non-Hispanic white, 56% had more than a high school education, 43% were considered high income based on PIR, 29% had evidence of current smoking, 29% reported not having any alcohol consumption during the past year or ever, 54% reported taking dietary supplements, 33% were considered obese, and 32% reported no leisure time physical activities during the past 30 d that lasted more than 10 min (

It is interesting to note some patterns across the analytes when comparing the change in ^{2}^{2}

To illustrate how controlling for more covariates affects

To provide information about broad patterns across the biomarkers for each of the covariates in the full regression model, a summary of the estimated adjusted changes is presented (

In developing a regression plan to assess the joint impact of 10 specific sociodemographic and lifestyle covariates for each of 29 biomarkers from NHANES 2003–2006 we tried to avoid some statistical practices that have been shown to capitalize on random variation such as repeated significance testing, data driven selection of optimal cut points for quantitative variables, automatic model selection approaches, and using the same data more than once to develop a regression model. Derksen and Keselman (^{2}^{2}^{2}

This problem in model building has long been recognized and stems from the fact that we use the same data twice. Chatfield (

A key factor in determining the accuracy of the estimate, both its bias and its variance, is how well confounding has been controlled for in the model (

While one of the primary advantages of limiting data driven decisions during the model building process is the preservation of the statistical properties of

When analyzing data from observational studies, there are many legitimate reasons to explore and examine data in advance of creating a regression model; such as, error checking to confirm the integrity of the data, assessing the size of the sample and types of variables, checking for possible influential observations or the degree of missing data, or using graphical methods to assess basic statistical assumptions (

In an effort to compare the effects of a fixed set of covariates across all biomarkers presented in the

No specific sources of financial support. The findings and conclusions in this report are those of the authors and do not necessarily represent the official views or positions of the Centers for Disease Control and Prevention/Agency for Toxic Substances and Disease Registry or the Department of Health and Human Services.

Author disclosures: M.R. Sternberg, R.L. Schleicher, and C.M. Pfeiffer, no conflicts of interest.

Abbreviations used: MA, Mexican American; MET, metabolic equivalent task; NHB, non-Hispanic black; NHW, non-Hispanic white; PIR, poverty income ratio

The authors acknowledge contributions from the following individuals: Bridgette Haynes and Yi Pan. C.M.P, M.R.S., and R.L.S designed the overall research project; M.R.S., C.M.P, R.L.S, and M.E.R conducted most of the research; M.R.S. analyzed most of the data and wrote the initial draft, which was modified after feedback from all coauthors, and had primary responsibility for content. All authors read and approved the final manuscript.

Increase in ^{2}

Sorted in ascending order based on model 2 ^{2}^{2}

25OHD, 25-hydroxyvitamin D ; 4PA, 4-pyridoxic acid; B, whole blood; B-12, total cobalamin; BI, body iron; CAR, carotenes [sum of

Relative change in

A: non-Hispanic black vs. non-Hispanic white; B: Mexican American vs. non-Hispanic white; C: 1 y increase in age; D: females vs. males.

In each panel,

Sorted by class of biomarker (water-soluble, fat-soluble, phytoestrogens, iodine, hemoglobin adducts of acrylamide); arrows point in the direction of the change of the

4PA, 4-pyridoxic acid; B, whole blood; B-12, total cobalamin; CAR, carotenes [sum of

Biomarkers of diet and nutrition assessed in the adult US population ≥20 y during all of part of NHANES 2003–2006

Class | Biomarkers | Survey | Population | Sample |
---|---|---|---|---|

Water-soluble | FOL (S), FOL (RBC), B-12 | 2003–2006 | ≥20 y | Full |

MMA (P) | 2003–2004 | ≥20 y | Full | |

PLP (S), 4PA (S) | 2005–2006 | ≥20 y | Full | |

Fat-soluble | VIA (S), VIE (S), | 2005–2006 | ≥20 y | Full |

25OHD (S) | 2003–2006 | ≥20 y | Full | |

SFA (P), MUFA (P), | 2003–2004 | ≥20 y | Fasted subsample | |

Trace elements | FER (S), sTfR (S), BI (S) | 2003–2006 | Women 20–49 y | Full |

uI (U) | 2003–2006 | ≥20 y | 1/3 Subsample | |

Phytoestrogens | GEN (U), DAZ (U), EQU (U), | 2003–2006 | ≥20 y | 1/3 Subsample |

Acrylamide | HbAA (B), HbGA (B) | 2003–2004 | ≥20 y | Full |

25OHD, 25-hydroxyvitamin D; 4PA, 4-pyridoxic acid; B-12, total cobalamin; BI, body iron; CAR, carotenes [sum of

B, whole blood; P, plasma; S, serum; U, urine

Mathematical forms of selected covariates for regression models

Chunk | Variable | Type | Transformation |
---|---|---|---|

Sociodemographic | Age, | Continuous | None |

Sex | Categorical (2 levels) | N/A | |

Race-ethnicity | Categorical (5 levels) | N/A | |

Poverty income ratio | Continuous | None | |

Education | Categorical (2 levels) | N/A | |

Lifestyle | Smoking status | Categorical (2 levels) | N/A |

Alcohol consumption | Continuous | ln(x + 1) | |

Supplement use | Categorical (2 levels) | N/A | |

BMI (kg/m^{2}) | Continuous | ln | |

Physical activity | Continuous | ln(x + 1) |

N/A, not applicable

Alcohol consumption: calculated as average daily number of “standard” drinks [(quantity x frequency) / 365.25]; 1 drink ≈ 15 g ethanol

Physical activity: calculated as total metabolic equivalent task (MET)-min/wk from self-reported leisure time physical activities

Descriptive information for the adult US population ≥20 y by sociodemographic and lifestyle factors, NHANES 2003–2006

Factor | Category | Estimate |
---|---|---|

Age, | 20–39 | 38.4 |

40–59 | 38.8 | |

≥60 | 22.8 | |

Sex | Male | 48 |

Female | 52 | |

Race-ethnicity | Mexican-American | 7.9 |

Non-Hispanic black | 11.4 | |

Non-Hispanic white | 72 | |

Other Hispanic | 3.5 | |

Other (including multiracial) | 5.4 | |

Education | ≤High school | 44.2 |

>High school | 55.9 | |

PIR | Low | 29.3 |

Middle | 28 | |

High | 42.7 | |

Smoking status | No | 71.2 |

Yes | 28.9 | |

Alcohol consumption | No drinks | 29.4 |

<1 (not 0) | 56.8 | |

1–<2 | 7.9 | |

≥2 | 6.0 | |

Supplement use | No | 45.9 |

Yes | 54.1 | |

BMI | Underweight | 1.8 |

Normal | 31.6 | |

Overweight | 33.4 | |

Obese | 33.3 | |

Physical activity | None reported | 32.1 |

0–<500 | 24.2 | |

500–<1000 | 14.0 | |

≥1000 | 29.7 |

Values represent weighted percentage using 4 y mobile examination center weights from NHANES 2003–2006

PIR, family poverty income ratio; low: 0–1.85; medium: >1.85–3.5; high: >3.5

“Smoker” defined by serum cotinine concentration >10 µg/L

Alcohol consumption: calculated as average daily number of “standard” drinks [(quantity x frequency) / 365.25]; 1 drink ≈ 15 g ethanol

“Supplement user” defined as participant who reported taking a dietary supplement within the past 30 d

BMI (kg/m^{2}) definitions: underweight: <18.5; normal weight: 18.5–>25; overweight: 25–<30; and obese: ≥30

Physical activity: calculated as total metabolic equivalent task (MET)-min/wk from self-reported leisure time physical activities

Estimated percent change in biomarker concentration after adjusting for sociodemographic and lifestyle factors using data for adults ≥20 y, NHANES 2003–2006

Analyte | Age: | Sex: | Race- | Race- | Education: | PIR | Supplement | Smoking | Alcohol | BMI | Physical |
---|---|---|---|---|---|---|---|---|---|---|---|

FOL (S) | 6.8 | 5.8 | −13.0 | −5.8 | −0.1 | −1.3 | 38.4 | −14.9 | −2.4 | −4.1 | 1.4 |

FOL (RBC) | 4.5 | 3.6 | −19.7 | −4.7 | −0.7 | −0.6 | 24.1 | −12.2 | 1.6 | 3.8 | 0.6 |

PLP (S) | −2.1 | −21.2 | −7.7 | 8.1 | −2.3 | −8.3 | 78.7 | −27.6 | 10.6 | −12.6 | 3.1 |

4PA (S) | 14.6 | −8.6 | −23.5 | −13.1 | −0.6 | −6.3 | 104 | −18.1 | −0.3 | −7.4 | 1.3 |

B-12 (S) | 1.0 | −3.7 | 20.2 | 15.4 | 3.1 | −0.9 | 20.8 | −6.2 | −3.3 | −4.3 | 0.6 |

tHcy (P) | 9.6 | −15.0 | 0.3 | −10.6 | −0.4 | 1.9 | −8.4 | 7.8 | 3.2 | −0.1 | −0.4 |

MMA (P) | 9.2 | −5.2 | −22.4 | −21.6 | −1.8 | 2.7 | −12.1 | −0.9 | −1.9 | −1.7 | −1.4 |

VIC (S) | 1.9 | 5.7 | 3.4 | 5.3 | −1.8 | −1.7 | 16.2 | −11.0 | −0.8 | −5.3 | 1.4 |

VIA (S) | 2.1 | −9.6 | −9.4 | −8.7 | −1.2 | −0.7 | 5.4 | −0.3 | 6.1 | −1.4 | 0.6 |

VIE (S) | 5.0 | −1.2 | −6.6 | 5.2 | −2.8 | −2.3 | 20.9 | −4.8 | −0.4 | −0.9 | 0.5 |

CAR (S) | −2.8 | 2.7 | 9.1 | −1.4 | −9.3 | −3.7 | 11.7 | −17.3 | 0.1 | −10.3 | 2.8 |

XAN (S) | 2.5 | −3.0 | 33.1 | 57.2 | −7.9 | −3.4 | 6.1 | −24.8 | −1.6 | −14.8 | 2.5 |

25OHD (S) | −0.8 | 0.1 | −23.6 | −12.0 | 1.6 | −0.6 | 5.2 | −1.5 | 1.6 | −4.3 | 1.5 |

SFA (P) | 0.4 | 2.0 | −6.3 | 8.4 | 2.2 | 0.4 | 3.5 | 1.5 | 4.8 | 4.4 | −0.3 |

MUFA (P) | 2.4 | −1.2 | −15.0 | 10.0 | 0.9 | 3.7 | 2.0 | 5.4 | 5.5 | 4.2 | −0.5 |

PUFA (P) | −0.1 | 1.8 | 0.4 | 8.9 | 1.7 | −1.6 | 1.3 | −2.4 | −0.6 | 0.3 | 0.3 |

tFA (P) | 0.7 | 0.6 | −5.3 | 10.1 | 0.6 | 0.4 | 2.2 | 1.2 | 3.1 | 2.0 | 0.2 |

FER (S) | 7.6 | N/A | −6.6 | −3.9 | −5.0 | −4.2 | 6.0 | 24.2 | 21.3 | 7.2 | 2.9 |

sTfR (S) | 1.5 | N/A | 18.8 | −3.9 | −2.6 | 3.4 | −1.1 | −10.7 | −5.9 | 6.0 | −0.7 |

BI (S) | 0.2 | N/A | −0.8 | 0.0 | −0.1 | −0.3 | 0.3 | 1.1 | 0.9 | 0.0 | 0.1 |

uI (U) | 11.4 | 3.8 | −33.7 | 3.7 | 4.6 | 3.4 | 22.1 | −6.6 | −7.1 | −0.5 | 0.1 |

EQU (U) | 0.9 | 19.5 | −41.2 | −35.7 | 1.1 | −7.9 | 13.6 | −6.5 | −17.8 | −2.5 | 4.3 |

DMA (U) | 7.7 | 28.0 | −17.5 | −55.6 | −9.3 | −14.6 | 12.0 | −28.3 | −20.2 | −3.1 | 2.3 |

ETD (U) | 8.7 | 39.7 | −24.1 | 8.9 | −15.6 | −16.5 | 8.6 | −10.5 | 12.9 | −10.8 | 2.9 |

ETL (U) | 12.4 | 15.5 | −17.6 | 24.4 | −6.5 | −13.7 | 1.0 | −32.3 | −5.4 | −21.1 | 6.5 |

DAZ (U) | 9.2 | 10.4 | −15.0 | −16.6 | −9.5 | −3.8 | 3.3 | −12.8 | −6.6 | −3.4 | 0.9 |

GNS (U) | 9.5 | 13.6 | −22.8 | −4.4 | −7.6 | −2.1 | 10.3 | −8.9 | −2.4 | −8.1 | 1.9 |

HbAA (B) | −2.9 | 3.6 | 0.1 | 4.0 | 5.9 | −0.1 | −3.5 | 126 | 2.4 | −4.8 | 0.6 |

HbGA (B) | −5.7 | 8.7 | 16.9 | 4.7 | 7.0 | −2.0 | −2.5 | 101 | −11.8 | 2.6 | 0.2 |

Change represents percent change (%) in geometric mean for all biomarkers except for vitamin C (µmol/L), 25-hydroxy vitamin D (µmol/L) and body iron (mg/kg) where change in arithmetic mean represents concentration units; change in each covariable was carried out while holding any other variables in the model constant; 25OHD, 25-hydroxyvitamin D; 4PA, 4-pyridoxic acid; B-12, total cobalamin; BI, body iron; CAR, carotenes [sum of

Iron status indicators (FER, sTfR and BI) were only measured in women of reproductive age, thus our analysis was limited to women 20–49 y of age

Hb AA, HbGA, MMA, SFA, MUFA, PUFA, and tFA, data only available for NHANES 2003–2004; 4PA, CAR, PLP, VIA, VIE, and XAN data only available for NHANES 2005–2006

B, whole blood; P, plasma; S, serum; U, urine

F, female; M, male

MA, Mexican American; NHB, non-Hispanic black; NHW, non-Hispanic white

HS, high school

PIR, family poverty income ratio

“Supplement user” defined as participant who reported taking a dietary supplement within the past 30 d

“Smoker” defined by serum cotinine concentration >10 µg/L

Alcohol consumption: calculated as average daily number of “standard” drinks [(quantity x frequency) / 365.25]; 1 drink ≈ 15 g ethanol

A 25% increase in BMI is comparable to a change from being normal weight to overweight

Physical activity: calculated as total metabolic equivalent task (MET)-min/wk from self-reported leisure time physical activities

Model includes total cholesterol and lipid altering prescription drug use

N/A; not applicable because data were only available for women

Model includes urine creatinine

Change is significantly different from 0;