In industrial hygiene, a worker’s exposure to chemical, physical,
and biological agents is increasingly being modeled using deterministic physical
models that study exposures near and farther away from a contaminant source.
However, predicting exposure in the workplace is challenging and simply
regressing on a physical model may prove ineffective due to biases and
extraneous variability. A further complication is that data from the workplace
are usually

A key concern of industrial hygiene is the estimation of a worker’s
exposure to chemical, physical, and biological agents. One goal of exposure modeling
is to represent the physical processes generating chemical concentrations in the
workplace. Physical models in industrial hygiene include a

Inference is improved by using observations from the workplace. Concentration is typically measured over a finite set of timepoints. The two-zone setting produces bivariate concentration measurements—one from the “near field” and another from the “far field.” Typically, however, there are discrepancies between the observations and the deterministic physical model since the physical model assumptions are violated in real workplace environments. A plausible choice for inputs to the two-zone model could, perhaps, be obtained by training them using trial-and-error until satisfactory agreement between the output and concentration measurements is achieved. That approach, however, is unattractive. Not only can finding satisfactory agreement between the observations and the physical model’s output be difficult, even if they agree the approach fails to account for the uncertainty in estimation and prediction. Model assessment would be completely ad hoc as well. A more principled approach estimates the physical model’s unknown inputs from the concentration measurements by making use of prior information on the input parameters. Usually, some prior information regarding the inputs to the physical model is available based upon physical considerations implied by the model or from experts with experience in workplace environments. A Bayesian modeling framework that allows synthesis of information from different sources is, therefore, attractive.

Synthesizing deterministic physical models with statistical models to achieve
improved inference continues to garner attention. One approach, Bayesian melding
(e.g.,

Stochastic processes are deployed to reckon with variability not accounted
for by the physical model.

Apart from addressing a new domain of application and the theoretical
implications therein, we specifically focus upon two pertinent statistical issues.
First, we deal with associated bivariate outcomes that are not only related by the
physical model, but are also likely to produce correlated residuals. We recognize
that even when the physical model is easily tractable, either analytically or
computationally, it is unable to account for extraneous variability in the
workplace. This occurs almost invariably in industrial hygiene
experiments—the physical model provides information on the overall trend but
is too inflexible to capture variation at smaller scales, thereby impairing
predictive performance. Second, the data from industrial workplaces are, more often
than not,

Statistical modeling for temporal processes can proceed either by treating
time as “discrete” or as “continuous” depending upon
whether inference (e.g., prediction or interpolation) is sought at the same temporal
resolution (e.g., “minutes” and “hours”) at which the
concentrations have been observed or whether it is sought at arbitrary resolutions.
Here, we treat the concentrations as smooth functions of time and offer inference at
arbitrary temporal resolutions. In this regard, our approach is arguably richer and
especially attractive for handling temporal misalignment. Our key modeling
ingredient is a multivariate Gaussian process. Apart from modeling the usual
residual variability, our framework achieves the following analytical objectives:
(i) approximate the trend (or bias) missed by the physical model for concentrations
in both fields, (ii) capture correlations across time (with process realizations
acting as time-varying random effects), and (iii) model the correlations among the
outcomes when we have multiple outcomes. These objectives resemble those in the
“calibration” of multi-output computer models, where Gaussian
processes emulators for the physical model are used to estimate the inputs (see,
e.g.,

The remainder of the article evolves as follows. In

The two-zone (or two-component) model assumes the presence of a contamination
source in the workplace and that the region is composed of two well-mixed zones or
fields. The zone very near and around the source is called the _{N} and
_{F} denote the volumes at the
near and far field, respectively.

In this context, the hygienist models the exposure concentrations at the
near and far fields based upon observations collected over a period of time. _{N}
(_{1}; _{F}
(_{1};

The solution of _{1};
_{1}; _{N}
(_{1}; _{F}
(_{1}; _{1} and λ_{2} are the
eigenvalues of _{1};
_{F}, and
_{N} are all strictly positive
(see

The exponential terms in

The experimental two-zone data that we analyze here is a part of a database
compiled from a series of designed experiments that were conducted in the industrial
hygiene laboratories at the University of Minnesota. The data consist of exposure
concentrations of toluene over a period of time, where ^{3}/min and 351.5 mg/min,
respectively. Measurements at 10 cm and 15 cm from the contamination source
represent the exposure concentrations in the near and far fields, respectively. The
near field is defined to be a 10 cm high cylinder with a radius of 10 cm around the
generation source. Consequently, _{N} =
^{−3} m^{3}. The
zone beyond the near field is the far field, which has
_{F} = 3.8 m^{3}.

A salient feature of this data, and what is not atypical in industrial hygiene, is that measurements at several timepoints are available in only one of the fields, but not simultaneously from both. Given limited resources and other logistics pertaining to setting up the experiment, observations are initially available only from the near field. As the experiment proceeds, we obtain measurements from both the fields. Since taking simultaneous measurements from both fields may be logistically difficult, toward the end of the experiment only the far field is measured to make up for the initial loss of information there.

A brief exploratory analysis of the data reveals why relying upon the
physical model alone for inference and scientific deductions is undesirable. Under
the assumption of zero initial concentration, the theoretical implication of the
two-zone model is that the concentration in the far field attains steady state after
about 351.5/13.8 ≈ 25 mg/m^{3}. Even a cursory glance at ^{3}.
Least-square analysis and other methods that purely rely upon regressing on the
physical model (e.g.,

We elucidate our approach using a generic setup that considers the following
distinct modeling ingredients: (a) an _{1}(_{m}(^{T}
taken at timepoint _{1}, in the physical model
that are unknown, and (c) variables _{1} = {_{N}_{F}}, and

Following recent research (see, e.g., _{i}(_{1};
_{i}(_{i}(_{i}
(_{1};
_{i}(_{1};
_{i} (_{i} (_{1};
_{1},

The critical element in _{1}, …,
_{n}} and that
_{m}
(_{m},
_{η}(_{2};
·, ·)) denotes a zero-centered _{η}(_{2};
_{i} (_{j}_{2} is a collection of
unknown parameters therein. The Gaussian process implies that
^{T}(_{1}),
…,
^{T}(_{n}^{T}
is distributed as an _{η}_{2};
_{mn}(_{mn},
∑_{η}_{2};
_{η}(_{2};
_{η}(_{2};
_{k},
_{l}

Clearly, care is needed when choosing
_{η}_{2};
·, ·) so that
_{η}_{2};
_{p}(_{p},
_{w}_{1}(_{p}(^{T}
is the

Now, assume _{p}. Accordingly,
_{w}_{i}_{i}_{i}_{i}_{i}_{i}_{1},_{p}}. Regardless of how
close _{i}(_{j}_{η}_{2};
_{2}} =
_{w}^{T},
where _{2} = {

If _{η}_{2};
_{i}_{j}_{η}_{2};
^{T}, which
means that we can, without loss of generality, set _{η}_{2};
_{η}(_{2};
_{mn}(_{mn},
Σ_{η}(_{2};
_{η}(_{2};
_{n} ⊗
_{w}(_{n} ⊗
^{T}) is guaranteed to be symmetric and positive
definite as long as _{w}(

It remains, then, to choose
_{1}(_{1};
·, ·),…,
_{p}(_{p};
·, ·). These will control the smoothness of the underlying
process. Had the process been an emulator for the physical model, as is often
the case for complex computer models (e.g., _{i}(_{i}^{−φi∣t–t′∣2}.
We, however, use the process to model time-varying random effects representing
unaccounted structured extraneous variation in the data. Excessive smoothness
will lead to poorer fits and is not desirable. For flexibly modeling smoothness
as well as strength of association, we opt for the Matern correlation function
_{i}(_{i};
·, ·)’s are Matérn functions with distinct
parameters. Specifically, let _{i}_{i},
_{i}} be the Matérn
parameters in _{i}_{i};
·, ·), _{2} =
{_{1},…,
_{p}_{1},…,
_{p}}. Several simpler
choices emerge as special cases, most notably the exponential
^{−ϕ∣t–t′∣},
which results by fixing

Turning to the measurement error process at any time-point
_{3} is the collection of the
_{ϵ}(_{3}),
where
var{_{j}(_{j}_{i}_{j}_{ij} is the
(

For _{i}_{i}) ∣
_{i}),
_{ϵ}_{3})),
where _{1},
_{2},
_{3}} (recall
_{1} =
{_{k}
(_{i}’s
have independent prior distributions, that is, _{i} is
the set of hyperparameters related to the prior distribution of
_{i}.

Estimation of

We will subsequently use the deviance information criterion (DIC) and a
modified predictive model choice criteria called the gneiting–raftery
scoring rule (GRS) as model comparison metrics. Let _{i} =
_{1};
_{i}_{i}_{ϵ}_{3}).
These parameters constitute the “focus” of the DIC.

_{j}_{i}^{rep}(_{i}^{rep} be the ^{rep}(_{i}^{rep} is

Each ^{rep}(_{i}_{i}^{rep} from
^{rep} ∣

_{i}. In practice, however,
it is not uncommon to encounter

An advantage of our process-based framework is that inference with
misaligned data can be accommodated with some minor tweaks. We elucidate with
the two-zone model (_{1}(_{2}(^{T}, and
_{1}(_{2}(_{1} be the set of timepoints that yield observations
only in the near field, _{2} be the time-points that yield
observations only in the far field, and _{12} be the
timepoints yielding simultaneous measurements from both the fields. The observed
data likelihood is now

For a more generic setup, some further details on implementation may be
useful. Let _{i}_{o} and _{m} the observed and
missing data, respectively. We can write _{o} and
_{m} by suitably extracting elements from
_{o} and _{m}, such that
_{o} = _{o}_{m} = _{m}_{o} is _{m} is (

Bayesian inference evaluates the full posterior predictive distribution
_{m},

Obtaining samples from _{o}): for each sampled
_{m} from
_{m} ∣
_{o}).
Matters are simplified because _{m} ∣
_{o}
~
_{nm–k}
(

Here,
_{y}_{2},
_{3}) =
∑_{η}(_{2};
_{n} ⊗
_{ϵ}_{3})
is the _{o} and
_{m} have full row rank and
_{y}_{2},
_{3}) is nonsingular.

We now apply our PBBM approach to datasets simulated from two-zone
experiments as well as the experimental data described in

Specifications for _{w}(_{1}, _{1}}.
For D and LT, _{1}, _{2},
_{1}, _{2}}.
Finally,

We first compare the performance of the models using synthetic two-zone
datasets that were generated according to the PBBM and BNLR frameworks.
Specifically, we simulate 100 independent datasets from the BNLR model and from
each of the three PBBM specifications in

As seen in _{i}’s and
_{i}_{i}’s can produce
more variable concentration curves that would be more congruous to models with
random effects. The input parameters for the two-zone model (i.e.,
⊥_{1} = {_{N}_{F}}) were taken from physical
considerations deemed plausible by industrial hygienists (e.g., _{N} =
1.1 m^{3}. Moreover, we assume
_{F} = 240 m^{3} and zero
initial concentrations in both fields. In this scenario, the theoretical
steady-state concentration at the near field (^{3}) is roughly three
times higher than that at the far field (^{3}). (Information regarding the prior settings are available
in the

We divide each simulated dataset into a training set and a test set. The training set consists of exposure concentrations in both fields at 70 timepoints randomly selected between 1 and 100 min. The testing set is composed of the exposure concentrations at the remaining timepoints. For each model, inference was based on 5000 posterior samples obtained from our MCMC algorithm after discarding the first 5000 iterations as burn-in. For random-walk Metropolis steps, we transformed parameters, if necessary, to have support on the real line so that normal proposals could be used and then transformed them back to the original scale. For the Gaussian process covariance functions, the substantive inference from the Matérn and the exponential were essentially indistinguishable. Subsequently, we present only the results for the Matern.

We now analyze the misaligned experimental data described in

_{1} and
_{2} are noticeably higher for the BNLR
than under PBBMs. This is unsurprising because the BNLR attributes the entire
variation in the data to measurement errors, while PBBM attributes part of the
variation to the underlying latent process as well.

We also see a substantial bias in the airflow
(

We proffered a PBBM approach for predicting exposure concentrations over
time in industrial workplaces. We believe our current application to be the first
serious venture of Bayesian melding in the domain of industrial hygiene, a field
that has a strong Bayesian presence in the form of subjective judgment but still
relies largely upon least squares and straightforward Bayesian regressions (BNLR)
(see, e.g.,

The PBBM is applicable whenever full inference on physical parameters and subsequent predictions are sought. We show that the PBBM delivers substantial, sometimes dramatic, improvements in inference than straightforward nonlinear regression. The PBBM approach reflects the variability much better, provides far superior fits to the data, yields better predictions, and, perhaps most importantly from the hygienist’s perspective, provides a much more realistic assessment of the uncertainties involved in the estimation of the model.

Based upon our current findings, we advocate estimating inputs to the
physical model whenever possible. We recognize that full inference here will require
solving the physical model, which may be infeasible in certain settings. However, a
very large number of physical processes can be formulated as general systems of
linear ODEs, whose solutions closely depend upon the eigenvalues of the coefficient
matrix (see “

The rich association structures permissible within PBBM are noteworthy.
Since this appears after regressing on the posited physical model, these structures
can be applied even if a posited physical model were computationally prohibitive. In
such cases, a distinct and smoother Gaussian process on the space of inputs can be
deployed as a fast interpolator or emulator for the physical model (e.g.,

A few extensions are worth noting. Clearly we have only skimmed the surface
in our choice of physical models. The experimental design can be modified to collect
measurements in different spatial locations at each timepoint. Space–time
physical models based upon diffusion principles can then be combined with
spatio-temporal stochastic processes to create highly flexible melding frameworks.
In fact, enrichments such as allowing the inputs to vary over space and time can be
envisioned as well. Future possibilities may also include space–time
dynamical specifications for

Extra statistical analysis for both simulation study and experimental data are provided in the file “AdditionalAnalysis.pdf.”

The conditions for the identifiability of process parameters in

The derivation of the solution of (1) is presented in the file “ODE.pdf.”

Dynamics of the two-zone model.

Two-zone experimental data.

Posterior predictive means for the replicated data plotted against the observed log exposure concentrations for the workplace data.

Matrix structures for

(a) V | (b) D | (c) LT |
---|---|---|

Parameter values used to simulate the synthetic two-zone datasets

Parameters | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

_{1} | _{2} | _{3} | ||||||||||||

Model | A | _{1} | _{2} | _{1} | _{2} | _{1} | _{2} | _{3} | _{1} | _{2} | _{12} | |||

PBBM | V | 7.25 | 15 | 105 | 8 | – | 2.5 | - | 0.032 | 0.141 | – | 0.0005 | 0.0100 | 0.0020 |

D | 7.25 | 15 | 105 | 15 | 8 | 0.5 | 2.5 | 0.032 | 0.141 | – | 0.0005 | 0.0100 | 0.0020 | |

LT | 7.25 | 15 | 105 | 15 | 8 | 0.5 | 2.5 | 0.032 | 0.062 | 0.127 | 0.0005 | 0.0100 | 0.0020 | |

BNLR | – | 7.25 | 15 | 105 | – | – | – | – | – | – | – | 0.0010 | 0.0200 | 0.0040 |

DIC and GRS metrics for the simulation study assuming no temporal misalignment. The standard errors, from the 100 simulations, are shown in parenthesis

GOF | Model | BNLR | D | LT | V |
---|---|---|---|---|---|

DIC | BNLR | −470.35 (14.68) | −470.38 (14.89) | −470.31 (15.11) | −470.43 (14.82) |

D | −356.83 (41.36) | −510.78 (18.44) | −507.11 (18.88) | −478.19 (25.45) | |

LT | −459.62 (33.95) | −522.40 (16.39) | −535.27 (16.60) | −522.71 (18.20) | |

V | −513.16 (29.60) | −546.63 (20.22) | −560.89 (17.23) | −560.09 (17.68) | |

GRS | BNLR | 619.81 (22.10) | 621.98 (22.53) | 627.36 (22.84) | 623.34 (22.65) |

D | 606.75 (35.52) | 723.08 (24.23) | 717.94 (28.15) | 661.50 (38.50) | |

LT | 597.21 (39.79) | 657.23 (25.98) | 703.07 (34.45) | 673.43 (37.04) | |

V | 626.01 (48.19) | 698.64 (34.52) | 735.82 (27.51) | 732.71 (29.44) |

Multivariate potential scale reduction factor (

BNLR | D | LT |
---|---|---|

1.07 | 1.01 | 1.01 |

DIC and GRS scores for the actual workplace data

Model | DIC | _{D} | GRS | |
---|---|---|---|---|

BNLR | 768.56138 | 0.9767 | 767.58468 | −721.9574 |

D | −2857.18172 | 36.46883 | −2893.65056 | 3819.8092 |

LT | −2856.37636 | 37.21186 | −2893.58822 | 3824.98638 |

Posterior summaries for the main parameters in the BNLR and PBBM in the workplace data

BNLR | LT | ||||||
---|---|---|---|---|---|---|---|

Par | Mean | 95% CI | MCSE | Par | Mean | 95% CI | MCSE |

2.059 | (1.999, 2.111) | 0.004 | 6.570 | (1.240, 12.587) | 0.117 | ||

_{1} | 0.371 | (0.317, 0.434) | 0.002 | _{1} | 1.283 | (0.776, 2.003) | 0.017 |

_{2} | 4.207 | (3.635, 4.906) | 0.012 | _{2} | 1.391 | (0.740, 2.227) | 0.018 |

_{12} | 1.249 | (1.079, 1.453) | 0.003 | _{3} | 0.488 | (−0.453, 1.576) | 0.021 |

D | _{1} | 1.2e-04 | (1e-04, 1.41e-04) | 3.2e-07 | |||

6.437 | (1.127, 12.732) | 0.125 | _{2} | 0.001 | (8.5e-04, 1.1e-03) | 3.4e-06 | |

_{1} | 1.245 | (0.716, 1.926) | 0.018 | _{12} | 2.55e-04 | (2.1e-04, 3e-04) | 8.4e-07 |

_{2} | 1.513 | (0.977, 2.292) | 0.011 | ||||

_{1} | 1.2e-04 | (1e-04, 1.44e-04) | 4.6e-07 | ||||

_{2} | 0.001 | (9e-04, 1.2e-03) | 4e-06 | ||||

_{12} | 2.5e-04 | (2.1e-04, 3e-04) | 1.1e-06 |