Wrote the paper: VAE GP BMG. Conceived and designed computational experiments, performed computational experiments, analyzed the data with LibSELDI and manually validated peaks: VAE. Conceived and designed laboratory experiments: GP ERU. Performed laboratory experiments, analyzed the data with Ciphergen Express and manually validated peaks: GP. Provided biological interpretation of the data: BMG. Provided statistical interpretation of the clinical data: J-MSL. Revised manuscript critically for important intellectual content: J-MSL ERU.
SELDI-TOF mass spectrometer's compact size and automated, high throughput design have been attractive to clinical researchers, and the platform has seen steady-use in biomarker studies. Despite new algorithms and preprocessing pipelines that have been developed to address reproducibility issues, visual inspection of the results of SELDI spectra preprocessing by the best algorithms still shows miscalled peaks and systematic sources of error. This suggests that there continues to be problems with SELDI preprocessing. In this work, we study the preprocessing of SELDI in detail and introduce improvements. While many algorithms, including the vendor supplied software, can identify peak clusters of specific mass (or m/z) in groups of spectra with high specificity and low false discover rate (FDR), the algorithms tend to underperform estimating the exact prevalence and intensity of peaks in those clusters. Thus group differences that at first appear very strong are shown, after careful and laborious hand inspection of the spectra, to be less than significant. Here we introduce a wavelet/neural network based algorithm which mimics what a team of expert, human users would call for peaks in each of several hundred spectra in a typical SELDI clinical study. The wavelet denoising part of the algorithm optimally smoothes the signal in each spectrum according to an improved suite of signal processing algorithms previously reported (the LibSELDI toolbox under development). The neural network part of the algorithm combines those results with the raw signal and a training dataset of expertly called peaks, to call peaks in a test set of spectra with approximately 95% accuracy. The new method was applied to data collected from a study of cervical mucus for the early detection of cervical cancer in HPV infected women. The method shows promise in addressing the ongoing SELDI reproducibility issues.
The data analysis pipeline following a SELDI study involves 1) preprocessing to produce quantified peak clusters, 2) manually validating peak clusters as a QC step, and 3) group analysis to find differences between cases and controls. The methodology for preprocessing SELDI involves multiple algorithmic steps, and has been reviewed in
In order to facilitate more accurate SELDI studies with larger sample sizes, we introduce a neural network model to improve the automation of the validation step along with major improvements to the LibSELDI preprocessing approach. The neural network is trained on approximately 4200 expert annotated peaks. In this way, the neural network mimics the validation behavior of our in-house scientists in a more automated and objective fashion. The algorithm improvements to LibSELDI include 1) a 650× speed up of the algorithm, 2) improved denoising to reduce artifacts, and 3) quantitation. These algorithm improvements are demonstrated on a pooled-sample dataset. Finally, the improved LibSELDI is combined with the neural network and tested on a pilot clinical dataset consisting of samples from two different stages of cervical neoplasia. We compare the results of the LibSELDI/neural network approach to the standard Ciphergen Express analytical software on both the QC samples and the clinical samples.
This research was approved by the Centers for Disease Control and Prevention's Institutional Review Board. Informed consent was obtained in writing from participants in the study.
Cervical mucous was collected from women enrolled as part of an ongoing study of cervical neoplasia
The Protein Biological System II-c™ mass spectrometer, with Protein Chip software (version 3.2) (Ciphergen Biosystems, Fremont, CA) was used to perform SELDI-TOF MS as described previously
CE was used to preprocess the spectra following a modification of the standard operating procedure that has been developed in house and previously described
Orthogonal wavelet transforms, while having excellent denoising properties in the mean-squared error sense, can sometimes produce artifacts. These artifacts appear in the data as localized ringing in the vicinity of high frequency components/discontinuities (the pseudo-Gibbs effect) and reconstruction errors containing imprints of the particular wavelet basis used with the transform. To address these issues, Coifman and Donoho introduced the concept of cycle-spinning
We designed and implemented a zero-phase, finite-impulse response (FIR) filter for LibSELDI (LS) to prepare processed spectra for quantification using peak heights or peak areas. While LS has been shown previously to perform well at resolving the mean m/z of peak clusters in a group of spectra, the denoised output of the modified Antoniadis-Sapatinas algorithm often decreases the peak heights. This effect was noted previously by Besbeas et al.
To automate peak validation, a feed-forward neural network with one hidden layer and sigmoid activation function was built in 4 steps: 1) a large set of manually validated peaks to use for model parameter estimation was created, 2) peaks were divided into training/validation/test sets according to a 50/25/25 percent split, 3) model parameters were estimated,
Note that in (2)–(4),
Let
In a local window
For
Calculate the peak cluster prevalence as
We used a dataset of spectra from 31 pooled cervical mucous QC samples to evaluate the ability of LS and CE to accurately find peak cluster mean m/z values corresponding to reproducible peaks. We define a reproducible peak as one that is present at the same m/z value (within 0.3% mass error tolerance) in 80% or more of the spectra. Two of the authors (VE, GP) visually inspected every reproducible peak predicted by each method adhering to the following protocol:
Size of window or zoom was ±2% of the m/z value of the peak.
Peaks were categorized separately as Confirmed or Rejected for the processed and raw spectra. Agreement was required between authors VE and GP for close calls.
If a peak was confirmed in the processed spectra but rejected in the raw spectra, the final consensus call was “reject”, as the peak could be an artifact introduced after processing.
Criteria for rejection were:
Peaks that were too broad at a given m/z.
Peaks that could not be distinguished from the noise of the surrounding regions.
A cluster is rejected if there were less than 24 spectra with good peaks (prevalence = 24/30 = 80%).
Peak was clearly an artifact from the preprocessing step.
Once all reproducible peaks had a final annotation of confirmed or rejected, summary statistics were calculated to analyze the virtues of each approach. For each approach, we calculated.
In this case, the estimated TP is the number of reproducible peaks predicted that were confirmed by visual inspection, the estimated FN is the number of confirmed peaks that were missed, and the estimated FP is the number of peaks that were rejected as false after visual inspection. In other words, sensitivity is the percentage of reproducible peaks that are confirmed via manual validation, while the false discovery rate is the percentage of reproducible peaks that were rejected after manual validation but predicted by CE or LS.
We tested both CE and LS on a pilot clinical data set containing 16 controls and 8 cases. In addition to estimating mean peak cluster m/z values accurately, clinical data presents the additional challenge of accurately estimating peak cluster prevalence and peak height/area measurements. Fisher-exact tests with mid-P correction were used to test for significant prevalence differences between cases and controls. T-tests and Mann-Whitney U tests were performed to find peak clusters with significant differences. Peak clusters with statistically significant behavior were qualitative reviewed to check for quality of the preprocessing and neural network-based validation results.
All samples were tested in duplicate and provided spectral profiles in the initial run. In the case of 3 samples, duplicate spectra were removed after initial analysis using CE as the normalization factor (NormF) was greater than average NormF+2σ generated with the batch analysis. These samples were rerun and replicate spectra used in further analysis. In the case of LS analysis, duplicate or replicate spectra were averaged to produce a single spectrum representing each subject prior to peak detection, unlike CE. In the CE analysis, duplicates or replicates peak heights were averaged after peak detection but before t-tests/group analysis.
We estimated the quadratic variance function (QVF) from spaces between the peaks of the QC data. Peak-free regions were selected by visual inspection, with the mean and variance at each point calculated across spectra. A quadratic detector response curve is fit to the mean/variance points using least squares, and results of the QVF estimation procedure are shown in
Extending the modified Antoniadis-Sapatinas algorithm to use cycle spinning required decreasing the computational complexity of the algorithm. For the implementation of the mA-S algorithm used in
The slow part of the computation was the default implementation for calculating the term
Many computations were redundant in the 2-D wavelet decomposition used to calculate this term; i.e. the wavelet operator was the same for every spectrum and does not need to be recomputed every time.
The n × n wavelet operator was very sparse, containing approximately
Combining these facts we implemented a sparse matrix-based formulation of the problem to save memory and computation time rather than the standard 2-D filter bank decomposition. The wavelet transform matrix was constructed one time and a sparse-format matrix is stored off line. The sparse-matrix representation uses significantly less memory. Thus, in this implementation, every-time the
| n | ||||||
| Implementation | 210 | 211 | 212 | 213 | 214 | 215 |
| Original | 0.1235s | 0.5066s | 2.0476s | 7.8817s | 32.5274s | 1155.85s |
| Gen-Sparse | 0.0688s | 0.2392s | 0.9250s | 9.1161s | 23.3283s | 97.5260s |
| Offline-Sparse | 0.0089s | 00174s | 0.0407s | 0.0999s | 0.2177s | 1.7856s |
The FIR low-pass filter coefficients were estimated using the Parks-McClellan algorithm in MATLAB (firpm and firpmord functions). The specifications given the algorithm are: normalized frequency transition band between 0.15 and 0.25, pass-band ripple of 0.01, and stop band attenuation of 60 decibels (dB). The non-causal zero-phase implementation of the filter is used to prevent phase delays and preserve the locations of the peaks (filtfilt command in MATLAB)
This is a typical example where the Antoniadis-Sapatinas denoising would find the peak but distort its peak height.
A feed-forward artificial neural network with one hidden layer was designed that validates peaks with (96.5%, 93.4%, 94.7%) accuracy on the training set, validation set, and test set, respectively. Peaks were detected in the QC data using LS with the most liberal settings possible so as to cast a wide net for all possible peak configurations and shapes in the data for the training process. We manually validated each prediction as a confirmed peak, a noisy/false prediction, or indeterminate. All indeterminate predictions were removed from the model estimation process. After removal of indeterminate cases, we had 4256 expert-annotated predictions, containing 57.07% percent “true” (confirmed peak) examples. We then randomly sampled the data into an approximate 50/25/25% split for the training set, validation set, and test set respectively. To construct the corresponding 62-dimensional feature vector
For SELDI, all the information needed to make a judgment about peak validity is contained in a window
When deciding whether a peak is acceptable or not, both the raw and processed spectra carry important information; a good peak shape in the processed spectra is not sufficient by itself if the corresponding raw spectrum is very noisy and of poor quality.
Good peaks seem to look like a healthy concave quadratic centered at
These insights are applied to the following procedure used to construct the feature vector. Let
We used a fully-connected feed-forward neural network with 63 input nodes, 21 hidden layer nodes, and 1 output layer node (including bias nodes). Thus, the parameter matrices had dimensions
LibSELDI showed improved sensitivity and specificity for detecting peak cluster mean m/z values corresponding to reproducible peaks in the pooled-sample QC data.
LS w/neural network validation found 124 peak clusters and resolved peak predictions accurately within clusters and CE with a SOP were applied to the pilot clinical mucous data. An overview of the LS/neural network strategy for analyzing clinical data is shown in
| t-test, peak area | t-test, peak height | U-test, peak area | U-test, peak height | ||||||||
| CIN0 | CIN3 | CIN0 | CIN3 | CIN0 | CIN3 | CIN0 | CIN3 | ||||
| 16054.5 Da | 180 (8.6) | 150 (1.1) | 11821.9 Da | 0.69 (0.12) | 0.27 (0.07) | 6912.8 Da | 52.1 (0.6) | 58.2 (3) | 11821.9 Da | 0.69 (0.1) | 0.27 (0.1) |
| 3017.1 Da | 26.5 (1.6) | 21.9 (0.6) | 16054.5 Da | 0.58 (0.09) | 0.26 (0.003) | 12680.8 Da | 116.8 (3.8) | 136.5 (8.0) | 8287.7 Da | 0.15 (0.03) | 0.07 (0.008) |
| 8287.7 Da | 66.6 (0.9) | 64.0 (0.4) | 3017.1 Da | 0.48 (0.12) | 0.13 (0.03) | 10427.1 Da | 83.3 (1.9) | 88.5 (2.8) | 3682.3 Da | 0.46 (0.06) | 0.89 (0.16) |
| 11821.9 Da | 115.6 (5.1) | 100.4 (3.8) | 3682.3 Da | 0.46 (0.06) | 0.89 (0.16) | 3682.3 Da | 34.2 (1) | 40.3 (2.3) | 12680.8 Da | 0.38 (0.07) | 0.78 (0.15) |
| 2887.8 Da | 24.6 (0.5) | 22.6 (0.6) | 2887.8 Da | 0.32 (0.04) | 0.18 (0.03) | 5647.0 Da | 45.2 (0.6) | 42.7 (0.8) | 6912.8 Da | 0.13 (0.01) | 0.38 (0.13) |
| 5849.7 Da | 49.7 (2.2) | 43.1 (1.0) | 8287.7 Da | 0.15 (0.03) | 0.07 (0.01) | ||||||
| 3682.3 Da | 34.2 (1.0) | 40.3 (2.3) | 5647.0 Da | 0.22 (0.02) | 0.14 (0.02) | ||||||
| 12680.8 Da | 0.38 (0.07) | 0.78 (0.15) | |||||||||
Showing quantification (SEM) for clusters with p-values less than 0.05.
| Fisher-exact tests with mid-P correction | |||
| Cluster | Prevalence, CIN3 | Prevalence, CIN0 | p-value |
| 3017.1 Da | 0.375 | 0.938 | 0.007 |
Showing only clusters with p-values less than 0.05.
With the different parameters used for peak detection, CE detected 106 (stringent) and 168 (non-stringent) peak clusters. Under the stringent parameter (S/N 5/3), 6 peak clusters were found to have differences (p-values less than 0.05, no multiple test correction) (
| Cluster | Average Peak Height (SD) | |
| CIN3 | CIN0 | |
| 21663.4 | 0.843 (0.44) | 0.429 (0.41) |
| 3904.4 | 5.442 (7.75) | 12.201(10.72) |
| 7910.3 | 1.676 (1.36) | 6.00 (5.63) |
| 17205.5 | 0.121 (0.13) | 0.570 (0.55) |
| 8378.3 | 0.747 (0.35) | 1.571 (1.26) |
| 17341.8 | 0.171 (0.21) | 0.57 (0.60) |
Showing only clusters with p-values less than 0.05.
Neural networks were successful in their ability to automate the manual/visual validation step, mimicking the peak-calling performance of our in-house scientists with somewhere between 93%–95% accuracy. While this is very good classification performance for a complex task, we feel that for a true revolution to take place in SELDI preprocessing automation we would require a classifier with classification accuracies greater than 99.9%. After all, at our current accuracy rates, we still expect the neural-network validator to make 1–2 validation mistakes per cluster on our data. We feel strongly that if we could increase our training data by an order of magnitude (from ∼5000 peak examples to ∼50,000), the neural network approach we outlined could achieve such accuracy. With a classification accuracy of 99.9% we would only expect to make a single mistake validating a peak cluster representing a sample size of 1000! Such performance would enable the design of large studies with greater statistical power for making a biological discovery.
A case study in the challenges arising in biomarker discovery is the proteomics literature studying breast cancer. Starting in approximately 2002, breast cancer studies began to appear using the SELDI platform. Over the next several years, many studies followed using different specimens (mostly serum, plasma, or nipple aspirate fluid or NAF), on different groups of patients (early stage breast cancer, post-operative, benign breast cancer, those undergoing surgery, chemotherapy, radiation treatment, or some combination of the above), and some using the closely related MALDI instead of SELDI. Several proteins of interest began to emerge from the studies as being reproducible. Two helpful reviews by Calleson
Through a series of advancements to the different parts of the processing pipeline, LibSELDI has shown great promise for a level of detail in analysis of clinical data that was previously unavailable. The combination of the Antoniadis-Sapatinas algorithm-based denoising with an FIR filter designed for better noise suppression properties than popular Savitsky-Golay filter was a good combination of the strengths of each approach. The A–S algorithm has shown good performance for detecting and estimating peak cluster mean m/z values on simulated, pooled-sample QC, and clinical data. The tendency of the A–S denoising approach to unsatisfactorily alter the peak heights in the denoised spectra is balanced carefully with the FIR-filter based quantification step. We illustrated that the FIR-filtering step on its own would produce too many peak predictions, leading to many false positive peak clusters. By gluing these two methods together we have been able to capitalize on their respective strengths. We have confirmed that SELDI spectra are too inherently bumpy for a single denoising method to be superior at both the peak detection and quantification steps.
The computational tricks that enabled inclusion of the cycle-spinning variant of the modified A–S algorithm were also important, bringing LS a step closer towards enabling the use of SELDI for study designs with large sample size. We showed that a dataset that used to take 16 hours to process can now be processed in under 1.5 minutes. The addition of cycle-spinning reduced the energy in wavelet artifacts present in the denoised spectra, which led to increased sensitivity and specificity of the algorithm when benchmarked on the pooled-sample QC data.
The LibSELDI/neural-network validated peak clusters gave us higher quality predictions, allowing us to resolve individual peak predictions within clusters to a degree of accuracy that gave us fairly accurate measurements of the peak prevalence in each cluster and with respect to the case/control labels in the study. We were able to carry out a series of both parametric and non-parametric analyses based on peak height, peak area, and peak cluster prevalence. These analyses would have been impractical to conduct using the Ciphergen Express software alone. In all, 11 unique clusters came out of the analysis at a significance level below 0.05. Performing the study and analysis on a clinical set with larger sample size is a future work.
The current study contains several limitations that should be noted when interpreting the results. First, we have only shown results on a single sample medium analyzed on a single chip type. While we feel that LibSELDI algorithm extensions and neural-network validation model will extend to other Protein Chips and sample types (e.g. serum, plasma), we have not shown that in this paper. Also, the extension of the neural network to other chip and sample types may require adding significant additional training data to tune the neural network. In general, the baseline removal process has an effect on the quantification of peak heights and peak areas. To the authors' knowledge, there has been no study to date isolating the effect of baseline removal on peak quantitation. This holds for true for LS that we do not have a thorough understanding of the effect of the baseline removal technique used. Lastly, the sample size in our pilot clinical sample is too small to make any real biological conclusions. The small sample size was convenient for performing a preliminary evaluation of the LS and CE methodology on real clinical data. Follow up studies with larger samples sizes will be necessary in order to understand the performance limits of the methodology and to assess potential increases in statistical power for making biological discovery. Lastly, while we have qualitatively observed an improvement in the characterization of peak prevalence for peak clusters in clinical data, there is considerable room in the literature for a future work quantifying the performance of peak cluster composition/prevalence estimation.
The algorithmic and computational improvements in LibSELDI combined with the neural network-based peak cluster validation model moves us one step closer to larger SELDI experiments with greater chance of reproducible biological discovery.