Many modern datasets are sampled with error from complex high-dimensional surfaces. Methods such as tensor product splines or Gaussian processes are effective and well suited for characterizing a surface in two or three dimensions, but they may suffer from difficulties when representing higher dimensional surfaces. Motivated by high throughput toxicity testing where observed dose-response curves are cross sections of a surface defined by a chemical’s structural properties, a model is developed to characterize this surface to predict untested chemicals’ dose-responses. This manuscript proposes a novel approach that models the multidimensional surface as a sum of learned basis functions formed as the tensor product of lower dimensional functions, which are themselves representable by a basis expansion learned from the data. The model is described and a Gibbs sampling algorithm is proposed. The approach is investigated in a simulation study and through data taken from the US EPA’s ToxCast high throughput toxicity testing platform.

Chemical toxicity testing is vital in determining the public health hazards posed by chemical exposures. However, the number of chemicals far outweighs the resources available to adequately test all chemicals, which leaves knowledge gaps when protecting public health. For example, there are over 80, 000 chemicals in industrial use with fewer than 600 subject to long term

As an alternative to long term studies, there has been an increased focus on the use of high throughput bioassays to determine the toxicity of a given chemical. To such an end, the US EPA ToxCast chemical prioritization project (

There is a large literature estimating chemical toxicity from SAR information. These approaches, termed Quantitative Structure Activity Relationships (QSAR) (for a recent review of the models and statistical issues encountered see

The only QSAR approach that has addressed the problem of estimating a dose-response curve is the work by

Assume that one obtains a _{i}.

One may use a Gaussian process (GP) (

As an alternative to GPs, one can use tensor product splines (^{P+Q}, which is often computationally intractable. The proposed model sidesteps these issues by defining a tensor product of two surfaces defined on

Rather than focus on nonparametric surface estimation using GPs or tensor product splines, one could consider the problem from a functional data perspective (

Clustering the functional responses is also a possibility. Here, one would model the surface using

The proposed approach creates a new basis, and the number of basis functions may impact the model’s ability to represent an arbitrary surface; to model complicated surfaces, a large number of basis functions are included in the model. Parsimony in this set is ensured by adapting to the number of components using the multiplicative gamma prior (

An alternative way to look at the proposed approach that of an ensemble learner (

In what follows, Section 2 defines the model. Section 3 gives the data model for normal responses and outlines a sampling algorithm. Section 4 shows through a simulation study the method outperforms many traditional machine-learning approaches such as treed Gaussian processes (

Consider modeling the surface _{jl}_{j}γ_{l}

Where tensor product spline models define the basis _{i}_{i}_{1}(_{K}_{i1}, …, _{iK}

To model _{1}(_{K}_{ik}_{k}g_{k}_{i}_{1} ⊗ _{1}, …, _{K}_{K}_{k}_{k}

For the tensor product, the functions _{k}_{k}

One may mistake this approach as a GP defined by the tensor product of covariance kernels (e.g., see

The value of _{1}, …, _{k}_{k}_{1}, 1) ⩽ _{1} > 1, the variances are stochastically decreasing favoring more shrinkage as _{k}g_{k}_{k}

The choice of _{1} defines the level of shrinkage. If _{1} is too large, the model will have too few components contributing to the sum, and if it is too small no shrinkage will take place. In practice, inference from the multiplicative gamma process is robust to choices in _{1}, and nearly identical inference was obtained when 1.5 ⩽ _{1} ⩽ 5 for the data example. Following _{1} = 2 is reasonable for many applications, and it is used in what follows.

Though GPs are used in the model specification, one may use polynomial spline models or process-convolution approaches (_{1}(_{K}

If the functions _{k}_{k}_{1}(_{k}_{k}_{1}(_{K}

When

A data model is outlined for normal errors. Extensions to other data generating mechanisms from this framework are straightforward. For example, extensions to count or binomial data are possible using the Pólya – Gamma augmentation scheme of

For the data model, assume that for cross section _{i}_{i}_{i}_{ic}_{ic}^{−1}). Model (_{0}(

In defining _{k}_{k}_{k}_{k}_{k}_{k}_{0}, let

Given these choices, the data model is
_{k}_{k}.

This approach was tested on synthetic data. The dimension of

To create a dataset, the chemical information vector _{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

In specifying the model, priors were placed over parameters that reflect assumptions about the curve’s smoothness. For _{k}_{k}_{k}_{k}_{k}

A total of 12000 MCMC samples were taken with the first 2000 disregarded as burn in. For storage purposes, one of ten observations were saved. Trace plots from multiple chains were monitored for convergence, which occurred before 500 iterations.

To analyze the choice of

For each dataset,

To investigate the model’s performance in a situation comparable to the data example, an additional simulation is investigated. Here 50 datasets are constructed where

The approach is applied to data released from Phase II of the ToxCast high throughput platform. The AttaGene PXR assay was chosen as it has the highest number of active dose-response curves across the chemicals tested. This assay targets the Preganene X receptor, which detects the presence of foreign substances and up regulates proteins involved with detoxification in the body. An increased response for this assay might relate to the relative toxicity of a chemical.

Chemical descriptors were calculated using Mold^{2} (^{2} computes 777 unique chemical descriptors. For the descriptors, a principal component analysis was performed across all chemicals. This is a standard technique in the QSAR literature (_{i}

The database was restricted to 969 chemicals having SMILES information available. In the assay, each chemical was tested across a range of doses between 0

A random sample of 669 chemicals was trained to this data, and the remaining 300 observations were used as a hold-out sample. In this analysis, _{k}_{15} was monitored; it was less than 0.02, indicating additional tensor products were not needed.

To compare the prediction results, boosted MARS and neural networks were used; treed Gaussian processes were attempted, but the R package ‘tgp’ crashed after 8 hours during burn-in. The method of

In comparison to the other models, the adaptive tensor product approach also had the lowest predicted mean squared error and the predicted mean absolute error for the data in the hold-out sample. Here the model had a predicted mean squared error of 342.1 and mean absolute error of 11.7, as compared with values of 354.7 and 12.4 for neural networks as well as 383.6 and 13.4 for MARS. These results are well in line with the simulation.

One can also compare the ability of the posterior predictive data distribution to predict the observations in the hold out sample. To do this, lower and upper tail cut-points defined by

The proposed approach allows one to model higher dimensional surfaces as a sum of a learned basis, where the effective number of components in the basis adapts to the surface’s complexity. In the simulation and motivating problem, this method is shown to be superior to competing approaches, and, given the design of the experiment, it is shown to require fewer computational resources than GP approaches. Though this approach is demonstrated for high throughput data, it is anticipated it can be used for any multi-dimensional surface.

In terms of the application, this model shows that dose-response curves can be estimated from chemical SAR information, which is a step forward in QSAR modeling. Though such an advance is useful for investigating toxic effects, it can also be used in therapeutic effects. It is conceivable that such an approach can be used

Future research may focus on extending this model to multi-output functional responses. For example, multiple dose-responses may be observed, and, as they target similar pathways, are correlated. In such cases, it may be reasonable to assume their responses are both correlated to each other and related to the secondary input, which is the chemical used in the bioassay. Such an approach may allow for lower level

The author thanks Kelly Moran, Drs. A. John Bailer, Eugene Demcheck, the associate editor, and three anonymous referees for their comments on earlier versions of this manuscript.

7. Supplementary materials

Web appendices and computer code referenced in Sections 3 and 4 are available with this paper at the Biometrics website on Wiley Online library.

^{2}, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics

Example of the problem for a 2-dimensional surface. Two 1-dimensional cross sections are observed (black lines) from the larger 2-dimensional surface.

Comparison of the predictive performance between the adaptive tensor product and the treed Gaussian process. In the figure, the corresponding model’s root mean squared predicted error is given as a contour plot. The heat map represents the maximum dose response given the coordinate pair; lighter colors represent greater dose-response activity.

Four posterior predicted dose-response curves (black line) with corresponding 90% equal tail quantiles from the posterior predictive data distribution (dotted lines) for four chemicals in the hold out samples having repeated measurements per dose. Grey dash-dotted line represents the predicted response from the bagged neural network.

Four posterior predicted dose-response curves (black line) with corresponding 90% equal tail quantiles from the posterior predictive data distribution (dotted lines) for four chemicals in the hold out samples having repeated measurements per dose. Grey dash-dotted line represents the predicted response from the bagged neural network.

Mean squared prediction error in the simulation of the adaptive tensor product approach for four values of K as well as treed Gaussian processes, bagged neural networks, and bagged multivariate regression splines (MARS).

Adaptive TP | ||||||||
---|---|---|---|---|---|---|---|---|

K=1 | K=2 | K=3 | K=15 | Neural Net | MARS | Treed GP | ||

2-dimensions | N=75 | 76.2 | 69.1 | 69.5 | 69.8 | 108.3 | 215.2^{1} | 839.7^{1} |

N=125 | 56.9 | 48.5 | 48.8 | 48.7 | 92.4 | 205.8 | 158.0^{1} | |

N=175 | 48.5 | 37.7 | 38.4 | 38.3 | 85.1 | 198.8 | 61.1 | |

| ||||||||

3-dimensions | N=75 | 164.9 | 162.0 | 155.4 | 155.4 | 185.2 | 246.6^{1} | 1521.5^{1} |

N=125 | 128.6 | 125.0 | 121.0 | 121.0 | 160.4 | 223.4 | 421.3^{1} | |

N=175 | 106.3 | 102.6 | 99.7 | 100.1 | 150.0 | 217.7 | 163.5^{1} |

Trimmed mean used with 5% of the upper and lower tails removed.