Reverse engineering gene networks and identifying regulatory interactions are integral to understanding cellular decision making processes. Advancement in high throughput experimental techniques has initiated innovative data driven analysis of gene regulatory networks. However, inherent noise associated with biological systems requires numerous experimental replicates for reliable conclusions. Furthermore, evidence of robust algorithms directly exploiting basic biological traits are few. Such algorithms are expected to be efficient in their performance and robust in their prediction.
We have developed a network identification algorithm to accurately infer both the topology and strength of regulatory interactions from time series gene expression data in the presence of significant experimental noise and non-linear behavior. In this novel formulism, we have addressed data variability in biological systems by integrating network identification with the bootstrap resampling technique, hence predicting robust interactions from limited experimental replicates subjected to noise. Furthermore, we have incorporated non-linearity in gene dynamics using the S-system formulation. The basic network identification formulation exploits the trait of sparsity of biological interactions. Towards that, the identification algorithm is formulated as an integer-programming problem by introducing binary variables for each network component. The objective function is targeted to minimize the network connections subjected to the constraint of maximal agreement between the experimental and predicted gene dynamics. The developed algorithm is validated using both
Our integer programming algorithm effectively utilizes bootstrapping to identify robust gene regulatory networks from noisy, non-linear time-series gene expression data. With significant noise and non-linearities being inherent to biological systems, the present formulism, with the incorporation of network sparsity, is extremely relevant to gene regulatory networks, and while the formulation has been validated against
The progress in the field of experimental techniques in systems biology in recent years has contributed significantly to the analysis and understanding of gene regulatory networks [
A variety of modeling approaches have been developed recently for inferring genetic networks from gene expression data. Identification algorithms are dependent on how the network is modeled [
In the current report, we model the dynamics of gene expression by S-system formulation. Upon doing so, we formulate the network identification algorithm as a bi-level optimization problem, governed by the hypothesis of network sparsity. Network sparsity has been experimentally observed in various biological systems such as the visual system of primates [
Pseudo-code of the robust network identification algorithm implementation.
The performance of the developed bi-level integer programming algorithm is demonstrated on three case studies. In the first case study, we consider
The purpose of this case study is to validate the algorithm on a small network with and without experimental noise. The chosen 5-gene network model [
Using the S-system formulation, the 5-gene network model can be represented by the system of five coupled nonlinear ode, shown in Additional file
Figure
The performance of the algorithm is next analyzed in the presence of experimental noise, generated by adding 5% Gaussian noise to the time-course data generated from equation (1) shown in Additional file
Figure
The expected values of the S-system parameters estimated at 90% confidence level are represented in Table
Comparison of the identified S-system reaction order values to actual values
| G1G3 | 1 | 1.2 ± 0.09 |
| G1G5 | -1 | -1.0 ± 0.03 |
| G2G1 | 2 | 2.4 ± 0.04 |
| G2G3 | NA | -3.4 ± 0.06 |
| G2G4 | NA | 3.9 ± 0.05 |
| G3G2 | -1 | -1.1 ± 0.03 |
| G4G3 | 2 | 1.9 ± 0.02 |
| G4G5 | -1 | -1.0 ± 0.01 |
| G5G4 | 2 | 2.0 ± 0.02 |
Comparison of the identified S-system rate constant values to actual values
| X1 | 5 | 3.8 ± 0.2 | 10 | 18.0 ± 0.8 |
| X2 | 10 | 13.8 ± 0.9 | 10 | 16.2 ± 0.2 |
| X3 | 10 | 13.8 ± 0.2 | 10 | 11.2 ± 0.23 |
| X4 | 8 | 8.1 ± 0.1 | 10 | 11.8 ± 0.1 |
| X5 | 10 | 10.3 ± 0.05 | 10 | 8.9 ± 0.03 |
To evaluate the accuracy of the formulism under increased uncertainty, the algorithm was tested under various amounts of added noise. As one would expect, the accuracy of the algorithm depends on the level of noise added to the
Effect of added noise on the network identification results
| 5 | 1 | 0.78 |
| 7 | 0.57 | 1 |
| 10 | 0.57 | 1 |
White Gaussian noise was added at different amounts to the five-gene network
While the analysis is performed on 1000 bootstrap samples, it is computationally expensive to solve 1000 network identification problems. Hence, we investigated the sensitivity of the identified robust network on the number of bootstrap samples by considering a broad range of samples from 200 to 1000. Figure
Convergence study on network identification results using bootstrapping with 5% noise.
To assess the necessity of this bootstrapping technique, the aforementioned results were compared to a control group which did not utilize bootstrapping. To do this, a more deterministic approach was employed. Experimental replicates were generated as detailed: 10% white Gaussian noise was added to the 5-gene
In this example we investigate the performance of the developed algorithm in a larger network consisting of ten genes, as depicted in equation (2) shown in Additional file 1. For the deterministic case study the tolerance was specified at a low value of 10-5. Because the 10-gene network increases the number of binary variables in the upper level to 100, more GA generations are needed to obtain a converged solution; therefore, the number of generations was increased to 1000. The identified connections and kinetic parameters are shown in Figure
The proposed algorithm is next applied to the SOS DNA repair system of
Identification of this 6 gene network will require 36 binary variables; hence the GA parameters were retained similar to our first case study presented earlier: 20 populations evolved through 200 generations. The error tolerance, however, had to be relaxed to a higher value of .7 because of noise inherent in the experimental data set. Figure
In the next step the robust connections of the identified network are further analyzed by bootstrapping the experimental data set. Since our previous analysis on the first case study demonstrated 200 bootstrap samples to be adequate, in this example we generated 300 artificial data sets from the original experimental repeats. The network identification algorithm was solved at each of the data sets to generate 300 alternate networks. The frequency of occurrence of each network connection is analyzed over the array of alternate network and connections appearing with over 45% frequency are considered to be robust. Figure
| G1G2 | 0.9 ± 0.04 |
| G1G3 | 1.4 ± 0.03 |
| G1G4 | 0.9 ± 0.04 |
| G2G3 | 0.9 ± 0.04 |
| G2G4 | 0.9 ± 0.07 |
| G2G5 | 0.8 ± 0.08 |
| G3G4 | 1.0 ± 0.03 |
| G3G5 | 0.9 ± 0.03 |
| G4G4 | 1.3 ± 0.07 |
| G4G5 | -0.7 ± 0.05 |
| G5G1 | 1.0 ± 0.14 |
| X1 | 3.2 ± 0.28 | 8.4 ± 0.12 |
| X2 | 1.5 ± 0.09 | 1.6 ± 0.19 |
| X3 | 1.7 ± 0.21 | 1.5 ± 0.17 |
| X4 | 5.3 ± 0.16 | 1.6 ± 0.07 |
| X5 | 1.6 ± 0.16 | 2.0 ± 0.15 |
| X6 | 4.3 ± 0.22 | 3.8 ± 0.12 |
In this work, we present an algorithm to identify robust regulatory networks from time profiles of gene expression data. Our identification algorithm is primarily developed on the hypothesis of sparsity of biological network connections. In our earlier work we established the validity of the hypothesis of sparsity using a simplified linear ode representation of gene expression dynamics in a deterministic system. Herein we further advance the algorithm by incorporating more realistic non-linear representation using an S-system formulation of gene expression dynamics. The identification algorithm is formulated as a bi-level optimization problem in which the upper level solves an integer programming problem while the lower level is a continuous parameter identification problem. Furthermore, we propose a framework to incorporate noisy experimental data towards identification of a robust regulatory network. This is done by first generating artificial experimental repeats using the bootstrapping technique, followed by solving the identification formulation at each of the bootstrap data sets. From this library of identified prospective networks we isolate the most-repeated network connections which we hypothesize to be a robust connection, having low variability to experimental noise.
The upper level integer programming problem is solved using GA. There are several advantages of using GA to solve the above problem, the most important being that it does not require gradient evaluation. This is a significant advantage for the above problem with non-linear ode as constraint function. In addition, GA starts its search not from a single point in the feasible parameter space, but from multiple locations specified in the starting population. Hence, it holds the chance of converging at global minima, although such convergence cannot be guaranteed with GA. However, it also suffers from the disadvantage of increased computational cost. All the computations reported here have been carried out on 2.66 Ghz processer and 16 GB RAM server. The computational time for the five gene network without noise was 1 hour and the same network with noise was 2.5 hours. The computational time for the experimental data was 3 hours. For the 10 gene network, the genetic algorithm needed more generations to converge, resulting in computational time of 11 hours. Hence, extension of the current solution procedure to a much larger data set will be expensive. While the same formulation will still be applicable in a larger system, alternate solution procedures are currently being investigated for its extension to larger networks.
In the formulation presented in equation (2), the only user defined parameter is the value of the tolerance which dictates how closely the model prediction must agree with experimental dynamics in order for the network to be considered in the overall algorithm. While for an
Effect of error constraint on 5-gene network identification, 5% noise
| 0.13 | 0.78 | 1.0 | 9 |
| 0.20 | 0.88 | 0.88 | 7 |
| 0.25 | 0.78 | 0.78 | 7 |
| 0.30 | 0.88 | 0.7 | 5 |
| 0.35 | 0.88 | 0.7 | 5 |
The performance of the developed robust identification formulation is illustrated using three different systems. The first two case studies are based on
The current approach offers an improvement on existing algorithms. Numerous studies have used the 5-gene network (the current case study I) to test the accuracy and efficiency of their network identification methods. A comparison between the methods is presented by Kimura
These results show that our bi-level integer optimization algorithm is able to effectively identify the topology and connection strength of gene regulatory networks, even when the gene dynamics are non-linear and noisy in nature. By using the biological trait of sparsity, the algorithm optimizes the number of connections in the network while maintaining agreement in gene temporal profiles with the experimental input data. Even with uncertainty and noise in the data, something which is unavoidable on an experimental level, our bootstrapping/identification combination was able to identify a robust network. While we have demonstrated the effectiveness of our algorithm on
Identification of the regulatory network from time series gene expression data first requires modeling the dynamic evolution of the individual genes constituting the network. Here we model gene dynamics as a set of coupled non-linear ode following the S-system formulation, which captures the non-linearity in gene expression profiles using a power-law kinetic representation.
For a system with N-genes, the S-system model can be represented using equation (1):
Where
Our network identification algorithm is primarily based on the hypothesis of sparsity of network connections governing biological systems. Hence our overall objective is to determine the sparsest network which can satisfactorily capture the observed network dynamics. Following this idea, the network identification problem is formulated as an optimization problem with the objective of promoting sparsity given the constraint of maximizing predictive capacity. Such problem definition results in a bi-level optimization problem, where the constraint itself is an unconstrained optimization problem. In the current formulation using S-system to model the gene expression level (equation (1)), the kinetic orders (
In the above formulation
GA is typically designed to handle unconstrained optimization problems. One technique for constraint handling in GA is by penalty function, where the constraint is conditionally incorporated in the objective function. For conditions violating the constraint the objective function is penalized, and not so otherwise. In the current formulation the constraint is incorporated in the objective function using the following modification of the objective function:
where
A significant advantage of the bi-level formulation is that it allows optimum utilization of experimental data by sequentially reducing the number of unknown parameters in the lower level. In a conventional least-square parameter estimation problem, the connectivity is fixed and includes all possible network connections. Therefore, the size of the identifiable system is restricted, governed by the availability of experimental data points so that number of unknown parameters is less than the number of data points. For instance, a single level algorithm, using the above S-System formulation, would be restricted to less than m-3 genes. However, in the current bi-level formulation, this restriction is relaxed. Because the number of network connections are first reduced in the upper level, the number of genes to be analyzed is not so restricted, with the only constraint coming from the connectivity:
Hence the constraint is imposed on the maximum number of binary variables assigned in the upper level, but does not constrain the total size of the analyzed network. Moreover, our primary objective being sparsity of network connections, the formulation essentially tries to minimize the number of connections assigned to 1. Hence, except for the very initial phase of GA evolution, the constraint defined in equation (2) typically does not become active, and never so in the final optimal solution.
Real world data typically contains noise due to experimental uncertainty and system stochasticity. Biological data are particularly notorious for its inherent heterogeneity and stochasticity [
An alternative to actual experimental repeats is to use bootstrapping. The purpose of this statistical technique is to estimate the distribution of the estimator around the unknown true value
In our algorithm, we are dealing with limited experimental data. Hence, following the above methodology, we generate a large artificial data set by repeated resampling of the limited experimental repeats. Once the bootstrapped samples are obtained, the network identification algorithm previously described is applied to all bootstrap data sets to identify a network corresponding to each. The network sets thus obtained is further analyzed to determine the frequency of occurrence of each connection in the entire set of identified networks. We hypothesize that frequent occurrence of network connections in the bootstrap samples indicate the insensitivity of the corresponding network to experimental noise, and hence claim that connection to be robust.
In order to quantify the quality of prediction of the proposed algorithm the measures of
Where: TP (True Positive) denotes the number of connections correctly captured; FN (False Negative) denotes existing connections which are not captured in the identified network; and FP (False Positive) denotes connections which are incorrectly captured in the identified network. Following the above equation: a low value of recall would indicate a more conservative estimate which is unable to capture many of the existing connections; a low value of precision will indicate prediction of incorrect connections not appearing in the actual network; and a value of 1 will indicate perfect network identification. The flow diagram of the overall network identification algorithm is shown in Figure
The authors declare that they have no competing interests.
NC, KT and IB developed the algorithm. NC and KT analyzed the data and performed output analysis. NC, KT, and IB drafted the manuscript. IB conceived of the study, and participated in its design and coordination. All authors read and approved the final manuscript.
Click here for file
This work was financed by NIH (DP2-16520).